Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clickhouseexporter]incorrect values of exemplar span id and trace id #31332

Closed
cezhang opened this issue Feb 20, 2024 · 20 comments
Closed

[clickhouseexporter]incorrect values of exemplar span id and trace id #31332

cezhang opened this issue Feb 20, 2024 · 20 comments
Assignees
Labels
bug Something isn't working exporter/clickhouse

Comments

@cezhang
Copy link

cezhang commented Feb 20, 2024

Component(s)

exporter/clickhouse

What happened?

Description

image

spanId and traceId are put in wrong column between each other

Steps to Reproduce

On my case, just use java agent to do auto instrumentation.

I check the source code of this exporter. It seems fine, probably somewhere go wrong with exemplar data.

Expected Result

The value of spanid and trace id should be swapped.

Actual Result

Collector version

0.9.2

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      batch:
        timeout: 15s
        send_batch_size: 100000
    exporters:
      debug:
        verbosity: detailed
      clickhouse:
        endpoint: tcp://proxy.o11y.svc.cluster.local:9000?dial_timeout=10s&compress=lz4
        database: otel
        ttl: 72h
        logs_table_name: otel_logs
        traces_table_name: otel_traces
        metrics_table_name: otel_metrics
        timeout: 5s
        retry_on_failure:
          enabled: true
          initial_interval: 5s
          max_interval: 30s
          max_elapsed_time: 300s
      prometheus:
        endpoint: ':8889'
    service:
      pipelines:
        logs:
          receivers: [otlp]
          processors: [batch]
          exporters: [clickhouse]
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [clickhouse]
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [clickhouse, prometheus]

Log output

No response

Additional context

No response

@cezhang cezhang added bug Something isn't working needs triage New item requiring triage labels Feb 20, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

Hello @cezhang, I took a quick look and couldn't find anything obviously wrong either. Could you add the debug exporter to your traces pipeline, and share an example of the output of the debug exporter compared to the ClickHouse backend's data? This would help us confirm the issue is somewhere in the ClickHouse exporter, and not something else.

@saitonakamura
Copy link

I experience the same problems with LogAttributes
image

debug exporter logs correlate with the data, attributes are randomly (maybe not randomly, but differently) sorted

otelcol-1     | 2024-02-29T12:58:53.598Z        info    LogsExporter    {"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 2}
otelcol-1     | 2024-02-29T12:58:53.598Z        info    ResourceLog #0
otelcol-1     | Resource SchemaURL: 
otelcol-1     | ScopeLogs #0
otelcol-1     | ScopeLogs SchemaURL: 
otelcol-1     | InstrumentationScope  
otelcol-1     | LogRecord #0
otelcol-1     | ObservedTimestamp: 2024-02-29 12:58:52.807378838 +0000 UTC
otelcol-1     | Timestamp: 2024-02-26 11:47:38 +0000 UTC
otelcol-1     | SeverityText: 
otelcol-1     | SeverityNumber: Unspecified(0)
otelcol-1     | Body: Empty()
otelcol-1     | Attributes:
otelcol-1     |      -> log.file.name: Str(json2.log)
otelcol-1     |      -> event: Str(videoMetrics)
otelcol-1     |      -> device: Str(Meta Quest 3)
otelcol-1     |      -> counter: Double(1)
otelcol-1     | Trace ID: 
otelcol-1     | Span ID: 
otelcol-1     | Flags: 0
otelcol-1     | LogRecord #1
otelcol-1     | ObservedTimestamp: 2024-02-29 12:58:52.807827588 +0000 UTC
otelcol-1     | Timestamp: 2024-02-26 11:13:48 +0000 UTC
otelcol-1     | SeverityText: 
otelcol-1     | SeverityNumber: Unspecified(0)
otelcol-1     | Body: Empty()
otelcol-1     | Attributes:
otelcol-1     |      -> event: Str(videoMetrics)
otelcol-1     |      -> log.file.name: Str(json2.log)
otelcol-1     |      -> device: Str(Meta Quest 3)
otelcol-1     |      -> counter: Double(2)
otelcol-1     | Trace ID: 
otelcol-1     | Span ID: 
otelcol-1     | Flags: 0

@saitonakamura
Copy link

saitonakamura commented Feb 29, 2024

I think it might be the problem with clickhouse itself, see this example
from
INSERT INTO table_map VALUES ({'key2':1, 'key1':10}), ({'key1':2,'key2':20}), ({'key1':3,'key2':30});
i would expect clickhouse to handle keys correctly, and have

key2 key1
1 10
20 2
30 3

but it's not the case

image

i don't see anything regarding positionality (is that even a word?) of map keys in the docs, but I didn't search extensively
https://clickhouse.com/docs/en/sql-reference/data-types/map

Reported in ClickHouse/ClickHouse#60571

@crobert-1
Copy link
Member

Thanks for the investigation @saitonakamura, very helpful! I'll keep this issue open for now, and we'll proceed depending on the result of the ClickHouse issue you've opened.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Feb 29, 2024
@crobert-1
Copy link
Member

crobert-1 commented Feb 29, 2024

Okay, looks like this is possibly a dbeaver issue. @cezhang, can you confirm you were using dbeaver? If so, we can close the issue here.

@saitonakamura
Copy link

I think the screenshot is from datagrip btw

@saitonakamura
Copy link

I'm gonna try to reproduce it again with otel and see if the issue is only with dbeaver and not clickhouse client

@saitonakamura
Copy link

saitonakamura commented Feb 29, 2024

I can confirm that clickhouse client reports the correct data

❯ echo 'SELECT * FROM otel_logs;' | curl 'http://localhost:8123/' --data-binary @-
2024-02-26 11:17:14.000000000                   0               0                               {}                              {}   
   {'log.file.name':'json2.log','event':'videoMetrics','device':'Meta Quest 3','counter':'2'}
2024-02-26 11:47:38.000000000                   0               0                               {}                              {}      
{'event':'videoMetrics','log.file.name':'json2.log','device':'Meta Quest 3','counter':'1'}

the issue might be the same for datagrip

@den-crane
Copy link

you can query keys explicitly: a['key1']

select a['key1'] k1, a['key2'] k2, a['key3'] k3 from table_map

https://fiddle.clickhouse.com/38ff4d0f-5840-4f90-baf7-3d45ba6f63ea

@cezhang
Copy link
Author

cezhang commented Mar 1, 2024

@crobert-1 I do use datagrid, but seems not related to that in this case.
I collect signals from two different k8s clusters into the same clickhouse. One seems be normal, but the other.
I test selection in clickhouse client and datagrid, same results. Also the discussion above is related to map type, but in my case Exemplar is nested structure.

WX20240301-153129
WX20240301-153044

@cezhang
Copy link
Author

cezhang commented Mar 1, 2024

from debug exporter

...
Exemplars:
Exemplar #0
     -> Trace ID: 42544a71d7d6a80668aca81d7d876fd7
     -> Span ID: e263867bcf0e924f
     -> Timestamp: 2024-03-01 07:54:47.995 +0000 UTC
     -> Value: 9.258925
     -> FilteredAttributes:
          -> http.target: Str(/loggers)
          -> net.sock.host.addr: Str(127.0.0.1)
          -> net.sock.host.port: Int(8080)
          -> net.sock.peer.addr: Str(127.0.0.1)
          -> net.sock.peer.port: Int(57610)
          -> user_agent.original: Str(Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36)
HistogramDataPoints #1
Data point attributes:
     -> http.method: Str(GET)
     -> http.route: Str(/)
     -> http.scheme: Str(http)
     -> http.status_code: Int(200)
     -> net.host.name: Str(localhost)
     -> net.host.port: Int(8080)
     -> net.protocol.name: Str(http)
     -> net.protocol.version: Str(1.1)

otel_logs:
image
otel_traces_trace_id_ts:
image
otel_traces:
image
otel_metrics_histogram:
image

clickhouse client:
image

@crobert-1
Copy link
Member

@cezhang So from your last comment, it looks like the only error is in the otel_metrics_histogram table, is that correct?

It looks like this logic is actually in the metrics pipeline, and the ClickHouse DB create and insert commands are in this file. It still looks correct to me though, I'm not seeing how they could be exported in the wrong order.

@cezhang
Copy link
Author

cezhang commented Mar 5, 2024

@cezhang So from your last comment, it looks like the only error is in the otel_metrics_histogram table, is that correct?

It looks like this logic is actually in the metrics pipeline, and the ClickHouse DB create and insert commands are in this file. It still looks correct to me though, I'm not seeing how they could be exported in the wrong order.

So far it only happens in otel_metrics_histogram in my case.
For the file you mentioned, it looks goods to me too.

@crobert-1
Copy link
Member

@hanjm @dmitryax @Frapschen: Any idea how traceID and spanID could get switched around for the otel_metrics_histogram table?

@den-crane
Copy link

this line looks suspicious https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/clickhouseexporter/internal/histogram_metrics.go#L121C68-L121C77

attrs, times, values, traceIDs, spanIDs := convertExemplars(dp.Exemplars())

In all other places the order is different: spanIDs, traceIDs

@crobert-1
Copy link
Member

this line looks suspicious https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/clickhouseexporter/internal/histogram_metrics.go#L121C68-L121C77

attrs, times, values, traceIDs, spanIDs := convertExemplars(dp.Exemplars())

In all other places the order is different: spanIDs, traceIDs

That caught my eye as well, but it looks like that's just the return order of the method being called (source).

@Frapschen
Copy link
Contributor

@cezhang can you double-check your two k8s clusters's otel collector version? this bug is fixed in https://github.com/open-telemetry/opentelemetry-collector-contrib/releases/tag/v0.93.0

@cezhang
Copy link
Author

cezhang commented Mar 13, 2024

@cezhang can you double-check your two k8s clusters's otel collector version? this bug is fixed in https://github.com/open-telemetry/opentelemetry-collector-contrib/releases/tag/v0.93.0

OMG, mine is 0.92. There is the cause. Thank you !

@Frapschen Frapschen self-assigned this Mar 13, 2024
@crobert-1
Copy link
Member

crobert-1 commented Mar 13, 2024

#30210 resolved this.

🤦 That makes so much more sense, thanks so much for the pointer @Frapschen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/clickhouse
Projects
None yet
Development

No branches or pull requests

5 participants