Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prometheusreceiver] - Critical regression since 0.30.0 #4907

Closed
gillg opened this issue Aug 27, 2021 · 8 comments
Closed

[prometheusreceiver] - Critical regression since 0.30.0 #4907

gillg opened this issue Aug 27, 2021 · 8 comments
Labels
bug Something isn't working comp:prometheus Prometheus related issues

Comments

@gillg
Copy link
Contributor

gillg commented Aug 27, 2021

Describe the bug
Hello, I just discovered a regression since the 0.30.0 release.
Lot of my metrics has values "NaN" since the version 0.30.0 ! (I tested lot of versions one by one...)

The problem(s) is probably in this list : open-telemetry/opentelemetry-collector@e8aeafa...79816e7

Workflow :
node exporter => prometheus receiver => batch processor => prometheus exporter

Steps to reproduce
Use a standard node exporter, scrape it with prometheus receiver, use prometheus exporter and see the content.

What did you expect to see?

# HELP node_xfs_vnode_get_total Number of times vn_get called for a filesystem.
# TYPE node_xfs_vnode_get_total counter
node_xfs_vnode_get_total{device="nvme0n1p1",otel_job="node-exporter"} 0 1630058901962
# HELP node_xfs_vnode_hold_total Number of times vn_hold called for a filesystem.
# TYPE node_xfs_vnode_hold_total counter
node_xfs_vnode_hold_total{device="nvme0n1p1",otel_job="node-exporter"} 0 1630058901962
# HELP node_xfs_vnode_reclaim_total Number of times vn_reclaim called for a filesystem.
# TYPE node_xfs_vnode_reclaim_total counter
node_xfs_vnode_reclaim_total{device="nvme0n1p1",otel_job="node-exporter"} 90322 1630058901962
# HELP node_xfs_vnode_release_total Number of times vn_rele called for a filesystem.
# TYPE node_xfs_vnode_release_total counter
node_xfs_vnode_release_total{device="nvme0n1p1",otel_job="node-exporter"} 90322 1630058901962
# HELP node_xfs_vnode_remove_total Number of times vn_remove called for a filesystem.
# TYPE node_xfs_vnode_remove_total counter
node_xfs_vnode_remove_total{device="nvme0n1p1",otel_job="node-exporter"} 90322 1630058901962
# HELP node_xfs_write_calls_total Number of write(2) system calls made to files in a filesystem.
# TYPE node_xfs_write_calls_total counter
node_xfs_write_calls_total{device="nvme0n1p1",otel_job="node-exporter"} 6.6463465e+07 1630058901962
# HELP otelcol_exporter_queue_size Current size of the retry queue (in batches)
# TYPE otelcol_exporter_queue_size gauge
otelcol_exporter_queue_size{exporter="otlp",otel_job="otel-collector",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 0 1630058899242
# HELP otelcol_exporter_send_failed_metric_points Number of metric points in failed attempts to send to destination.
# TYPE otelcol_exporter_send_failed_metric_points counter
otelcol_exporter_send_failed_metric_points{exporter="prometheus",otel_job="otel-collector",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 0 1630058899242
# HELP otelcol_exporter_sent_metric_points Number of metric points successfully sent to destination.
# TYPE otelcol_exporter_sent_metric_points counter
otelcol_exporter_sent_metric_points{exporter="prometheus",otel_job="otel-collector",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 1292 1630058899242
# HELP otelcol_process_cpu_seconds Total CPU user and system time in seconds
# TYPE otelcol_process_cpu_seconds gauge
otelcol_process_cpu_seconds{otel_job="otel-collector",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 0.9199999999999999 1630058899242
# HELP otelcol_process_memory_rss Total physical memory (resident set size)
# TYPE otelcol_process_memory_rss gauge
otelcol_process_memory_rss{otel_job="otel-collector",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 7.2527872e+07 1630058899242
# HELP otelcol_process_runtime_heap_alloc_bytes Bytes of allocated heap objects (see 'go doc runtime.MemStats.HeapAlloc')
# TYPE otelcol_process_runtime_heap_alloc_bytes gauge
otelcol_process_runtime_heap_alloc_bytes{otel_job="otel-collector",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 1.2060968e+07 1630058899242
# HELP otelcol_process_runtime_total_alloc_bytes Cumulative bytes allocated for heap objects (see 'go doc runtime.MemStats.TotalAlloc')
# TYPE otelcol_process_runtime_total_alloc_bytes gauge
otelcol_process_runtime_total_alloc_bytes{otel_job="otel-collector",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 2.6255504e+07 1630058899242
# HELP otelcol_process_runtime_total_sys_memory_bytes Total bytes of memory obtained from the OS (see 'go doc runtime.MemStats.Sys')
# TYPE otelcol_process_runtime_total_sys_memory_bytes gauge
otelcol_process_runtime_total_sys_memory_bytes{otel_job="otel-collector",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 7.531828e+07 1630058899242
# HELP otelcol_process_uptime Uptime of the process
# TYPE otelcol_process_uptime counter
otelcol_process_uptime{otel_job="otel-collector",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 15.033508176 1630058899242
# HELP otelcol_processor_accepted_metric_points Number of metric points successfully pushed into the next component in the pipeline.
# TYPE otelcol_processor_accepted_metric_points counter
otelcol_processor_accepted_metric_points{otel_job="otel-collector",processor="memory_limiter",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 646 1630058899242
# HELP otelcol_processor_batch_batch_send_size Number of units in the batch
# TYPE otelcol_processor_batch_batch_send_size histogram
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="10"} 0 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="25"} 1 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="50"} 1 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="75"} 1 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="100"} 1 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="250"} 1 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="500"} 1 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="750"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="1000"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="2000"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="3000"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="4000"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="5000"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="6000"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="7000"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="8000"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="9000"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="10000"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="20000"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="30000"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="50000"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="100000"} 2 1630058899242
otelcol_processor_batch_batch_send_size_bucket{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",le="+Inf"} 2 1630058899242
otelcol_processor_batch_batch_send_size_sum{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 646 1630058899242
otelcol_processor_batch_batch_send_size_count{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 2 1630058899242
# HELP otelcol_processor_batch_batch_size_trigger_send Number of times the batch was sent due to a size trigger
# TYPE otelcol_processor_batch_batch_size_trigger_send counter
otelcol_processor_batch_batch_size_trigger_send{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 1 1630058899242
# HELP otelcol_processor_batch_timeout_trigger_send Number of times the batch was sent due to a timeout trigger
# TYPE otelcol_processor_batch_timeout_trigger_send counter
otelcol_processor_batch_timeout_trigger_send{otel_job="otel-collector",processor="batch",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 1 1630058899242
# HELP otelcol_processor_dropped_metric_points Number of metric points that were dropped.
# TYPE otelcol_processor_dropped_metric_points counter
otelcol_processor_dropped_metric_points{otel_job="otel-collector",processor="memory_limiter",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 0 1630058899242
# HELP otelcol_processor_refused_metric_points Number of metric points that were rejected by the next component in the pipeline.
# TYPE otelcol_processor_refused_metric_points counter
otelcol_processor_refused_metric_points{otel_job="otel-collector",processor="memory_limiter",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c"} 0 1630058899242
# HELP otelcol_receiver_accepted_metric_points Number of metric points successfully pushed into the pipeline.
# TYPE otelcol_receiver_accepted_metric_points counter
otelcol_receiver_accepted_metric_points{otel_job="otel-collector",receiver="prometheus",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",transport="http"} 646 1630058899242
# HELP otelcol_receiver_refused_metric_points Number of metric points that could not be pushed into the pipeline.
# TYPE otelcol_receiver_refused_metric_points counter
otelcol_receiver_refused_metric_points{otel_job="otel-collector",receiver="prometheus",service_instance_id="2fb7e09f-d61f-4668-a942-31b37388c80c",transport="http"} 0 1630058899242
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total{otel_job="node-exporter"} 982.09 1630058901962
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds{otel_job="node-exporter"} 65535 1630058901962
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds{otel_job="node-exporter"} 9 1630058901962
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes{otel_job="node-exporter"} 1.5151104e+07 1630058901962
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds{otel_job="node-exporter"} 1.62981915151e+09 1630058901962
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes{otel_job="node-exporter"} 7.36108544e+08 1630058901962
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes{otel_job="node-exporter"} 1.8446744073709552e+19 1630058901962
# HELP promhttp_metric_handler_errors_total Total number of internal errors encountered by the promhttp metric handler.
# TYPE promhttp_metric_handler_errors_total counter
promhttp_metric_handler_errors_total{cause="encoding",otel_job="node-exporter"} 0 1630058901962
promhttp_metric_handler_errors_total{cause="gathering",otel_job="node-exporter"} 0 1630058901962
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight{otel_job="node-exporter"} 1 1630058901962
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200",otel_job="node-exporter"} 22652 1630058901962
promhttp_metric_handler_requests_total{code="500",otel_job="node-exporter"} 0 1630058901962
promhttp_metric_handler_requests_total{code="503",otel_job="node-exporter"} 0 1630058901962
# HELP scrape_duration_seconds Duration of the scrape
# TYPE scrape_duration_seconds gauge
scrape_duration_seconds{otel_job="node-exporter"} 0.032088469 1630058901962
scrape_duration_seconds{otel_job="otel-collector"} 0.002823924 1630058899242
# HELP scrape_samples_post_metric_relabeling The number of samples remaining after metric relabeling was applied
# TYPE scrape_samples_post_metric_relabeling gauge
scrape_samples_post_metric_relabeling{otel_job="node-exporter"} 635 1630058901962
scrape_samples_post_metric_relabeling{otel_job="otel-collector"} 41 1630058899242
# HELP scrape_samples_scraped The number of samples the target exposed
# TYPE scrape_samples_scraped gauge
scrape_samples_scraped{otel_job="node-exporter"} 635 1630058901962
scrape_samples_scraped{otel_job="otel-collector"} 41 1630058899242
# HELP scrape_series_added The approximate number of new series in this scrape
# TYPE scrape_series_added gauge
scrape_series_added{otel_job="node-exporter"} 635 1630058901962
scrape_series_added{otel_job="otel-collector"} 41 1630058899242
# HELP up The scraping was successful
# TYPE up gauge
up{otel_job="node-exporter"} 1 1630058901962
up{otel_job="otel-collector"} 1 1630058899242

What did you see instead?

# HELP node_xfs_vnode_get_total 
# TYPE node_xfs_vnode_get_total gauge
node_xfs_vnode_get_total{device="nvme0n1p1",otel_job="node-exporter"} NaN 1630059101854
# HELP node_xfs_vnode_hold_total Number of times vn_hold called for a filesystem.
# TYPE node_xfs_vnode_hold_total counter
node_xfs_vnode_hold_total{device="nvme0n1p1",otel_job="node-exporter"} 0 1630059101854
# HELP node_xfs_vnode_reclaim_total Number of times vn_reclaim called for a filesystem.
# TYPE node_xfs_vnode_reclaim_total counter
node_xfs_vnode_reclaim_total{device="nvme0n1p1",otel_job="node-exporter"} 90433 1630059101854
# HELP node_xfs_vnode_release_total Number of times vn_rele called for a filesystem.
# TYPE node_xfs_vnode_release_total counter
node_xfs_vnode_release_total{device="nvme0n1p1",otel_job="node-exporter"} 90433 1630059101854
# HELP node_xfs_vnode_remove_total 
# TYPE node_xfs_vnode_remove_total gauge
node_xfs_vnode_remove_total{device="nvme0n1p1",otel_job="node-exporter"} NaN 1630059101854
# HELP node_xfs_write_calls_total 
# TYPE node_xfs_write_calls_total gauge
node_xfs_write_calls_total{device="nvme0n1p1",otel_job="node-exporter"} NaN 1630059101854
# HELP otelcol_exporter_queue_size Current size of the retry queue (in batches)
# TYPE otelcol_exporter_queue_size gauge
otelcol_exporter_queue_size{exporter="otlp",otel_job="otel-collector",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} 0 1630059107194
# HELP otelcol_exporter_send_failed_metric_points Number of metric points in failed attempts to send to destination.
# TYPE otelcol_exporter_send_failed_metric_points counter
otelcol_exporter_send_failed_metric_points{exporter="prometheus",otel_job="otel-collector",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} 0 1630059107194
# HELP otelcol_exporter_sent_metric_points Number of metric points successfully sent to destination.
# TYPE otelcol_exporter_sent_metric_points counter
otelcol_exporter_sent_metric_points{exporter="prometheus",otel_job="otel-collector",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} 4966 1630059107194
# HELP otelcol_process_cpu_seconds Total CPU user and system time in seconds
# TYPE otelcol_process_cpu_seconds gauge
otelcol_process_cpu_seconds{otel_job="otel-collector",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} 1.69 1630059107194
# HELP otelcol_process_memory_rss Total physical memory (resident set size)
# TYPE otelcol_process_memory_rss gauge
otelcol_process_memory_rss{otel_job="otel-collector",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} 8.2448384e+07 1630059107194
# HELP otelcol_process_runtime_heap_alloc_bytes Bytes of allocated heap objects (see 'go doc runtime.MemStats.HeapAlloc')
# TYPE otelcol_process_runtime_heap_alloc_bytes gauge
otelcol_process_runtime_heap_alloc_bytes{otel_job="otel-collector",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} 2.0640088e+07 1630059107194
# HELP otelcol_process_runtime_total_alloc_bytes Cumulative bytes allocated for heap objects (see 'go doc runtime.MemStats.TotalAlloc')
# TYPE otelcol_process_runtime_total_alloc_bytes gauge
otelcol_process_runtime_total_alloc_bytes{otel_job="otel-collector",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} 4.5628168e+07 1630059107194
# HELP otelcol_process_runtime_total_sys_memory_bytes Total bytes of memory obtained from the OS (see 'go doc runtime.MemStats.Sys')
# TYPE otelcol_process_runtime_total_sys_memory_bytes gauge
otelcol_process_runtime_total_sys_memory_bytes{otel_job="otel-collector",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} 7.5580424e+07 1630059107194
# HELP otelcol_process_uptime Uptime of the process
# TYPE otelcol_process_uptime counter
otelcol_process_uptime{otel_job="otel-collector",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} 30.004101565 1630059107194
# HELP otelcol_processor_accepted_metric_points Number of metric points successfully pushed into the next component in the pipeline.
# TYPE otelcol_processor_accepted_metric_points counter
otelcol_processor_accepted_metric_points{otel_job="otel-collector",processor="memory_limiter",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} 2483 1630059107194
# HELP otelcol_processor_batch_batch_send_size_bucket 
# TYPE otelcol_processor_batch_batch_send_size_bucket gauge
otelcol_processor_batch_batch_send_size_bucket{le="+Inf",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="10",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="100",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="1000",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="10000",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="100000",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="2000",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="20000",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="25",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="250",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="3000",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="30000",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="4000",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="50",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="500",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="5000",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="50000",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="6000",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="7000",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="75",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="750",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="8000",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
otelcol_processor_batch_batch_send_size_bucket{le="9000",otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
# HELP otelcol_processor_batch_batch_send_size_count 
# TYPE otelcol_processor_batch_batch_send_size_count gauge
otelcol_processor_batch_batch_send_size_count{otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
# HELP otelcol_processor_batch_batch_send_size_sum 
# TYPE otelcol_processor_batch_batch_send_size_sum gauge
otelcol_processor_batch_batch_send_size_sum{otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} NaN 1630059097194
# HELP otelcol_processor_batch_batch_size_trigger_send Number of times the batch was sent due to a size trigger
# TYPE otelcol_processor_batch_batch_size_trigger_send counter
otelcol_processor_batch_batch_size_trigger_send{otel_job="otel-collector",processor="batch",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} 5 1630059107194
# HELP otelcol_processor_dropped_metric_points Number of metric points that were dropped.
# TYPE otelcol_processor_dropped_metric_points counter
otelcol_processor_dropped_metric_points{otel_job="otel-collector",processor="memory_limiter",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} 0 1630059107194
# HELP otelcol_processor_refused_metric_points Number of metric points that were rejected by the next component in the pipeline.
# TYPE otelcol_processor_refused_metric_points counter
otelcol_processor_refused_metric_points{otel_job="otel-collector",processor="memory_limiter",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547"} 0 1630059107194
# HELP otelcol_receiver_accepted_metric_points Number of metric points successfully pushed into the pipeline.
# TYPE otelcol_receiver_accepted_metric_points counter
otelcol_receiver_accepted_metric_points{otel_job="otel-collector",receiver="prometheus",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547",transport="http"} 2483 1630059107194
# HELP otelcol_receiver_refused_metric_points Number of metric points that could not be pushed into the pipeline.
# TYPE otelcol_receiver_refused_metric_points counter
otelcol_receiver_refused_metric_points{otel_job="otel-collector",receiver="prometheus",service_instance_id="54fcfc76-64d8-4ae0-a75f-0ecad093d547",transport="http"} 0 1630059107194
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total{otel_job="node-exporter"} 982.8 1630059101854
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds{otel_job="node-exporter"} 65535 1630059101854
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds{otel_job="node-exporter"} 9 1630059101854
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes{otel_job="node-exporter"} 1.6207872e+07 1630059101854
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds{otel_job="node-exporter"} 1.62981915151e+09 1630059101854
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes{otel_job="node-exporter"} 7.36108544e+08 1630059101854
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes{otel_job="node-exporter"} 1.8446744073709552e+19 1630059101854
# HELP promhttp_metric_handler_errors_total Total number of internal errors encountered by the promhttp metric handler.
# TYPE promhttp_metric_handler_errors_total counter
promhttp_metric_handler_errors_total{cause="encoding",otel_job="node-exporter"} 0 1630059101854
promhttp_metric_handler_errors_total{cause="gathering",otel_job="node-exporter"} 0 1630059101854
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight{otel_job="node-exporter"} 1 1630059101854
# HELP promhttp_metric_handler_requests_total 
# TYPE promhttp_metric_handler_requests_total gauge
promhttp_metric_handler_requests_total{code="200",otel_job="node-exporter"} NaN 1630059101854
promhttp_metric_handler_requests_total{code="500",otel_job="node-exporter"} NaN 1630059101854
promhttp_metric_handler_requests_total{code="503",otel_job="node-exporter"} NaN 1630059101854
# HELP scrape_duration_seconds Duration of the scrape
# TYPE scrape_duration_seconds gauge
scrape_duration_seconds{otel_job="node-exporter"} NaN 1630059101854
scrape_duration_seconds{otel_job="otel-collector"} 0.001854044 1630059107194
# HELP scrape_samples_post_metric_relabeling The number of samples remaining after metric relabeling was applied
# TYPE scrape_samples_post_metric_relabeling gauge
scrape_samples_post_metric_relabeling{otel_job="node-exporter"} NaN 1630059101854
scrape_samples_post_metric_relabeling{otel_job="otel-collector"} 40 1630059107194
# HELP scrape_samples_scraped The number of samples the target exposed
# TYPE scrape_samples_scraped gauge
scrape_samples_scraped{otel_job="node-exporter"} NaN 1630059101854
scrape_samples_scraped{otel_job="otel-collector"} 40 1630059107194
# HELP scrape_series_added The approximate number of new series in this scrape
# TYPE scrape_series_added gauge
scrape_series_added{otel_job="node-exporter"} NaN 1630059101854
scrape_series_added{otel_job="otel-collector"} 40 1630059107194
# HELP up The scraping was successful
# TYPE up gauge
up{otel_job="node-exporter"} NaN 1630059101854
up{otel_job="otel-collector"} 1 1630059107194

What version did you use?
Version: >= 0.30.0

What config did you use?

eceivers:
#  syslog:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 10s
          static_configs:
            - targets: ['otel-collector:8888']
          relabel_configs:
          # Trick because otel collector not expose the job
          - action: replace
            replacement: otel-collector
            target_label: otel_job
        - job_name: node-exporter
          scrape_interval: 10s
          static_configs:
            - targets: ['172.17.0.1:9100']
          relabel_configs:
          # Trick because otel collector not expose the job
          - action: replace
            replacement: node-exporter
            target_label: otel_job

exporters:
  prometheus:
    endpoint: "0.0.0.0:9095"
    send_timestamps: true
    metric_expiration: 60s

I don't detail the pipeline, but there is nothing special

Environment
OS: Linux, docker image

@gillg gillg added the bug Something isn't working label Aug 27, 2021
@gillg
Copy link
Contributor Author

gillg commented Aug 27, 2021

Additional information, I have a lot of error in console since 0.30.0. Exactly the sames than on https://github.com/open-telemetry/opentelemetry-collector/issues/3118#issuecomment-833399201

It seems not directly related because the previous issues is older than 0.30.0 but the symptom is similar in an different context (grafana metrics api > prometheus receiver > prometheus exporter). In 0.29.0 there are no errors in the console for the node exporter.

@odeke-em
Copy link
Member

@gillg thank you for the report! Given that you have the setup and you know what you look for, can you perhaps binary bissect the offending commit, essentially: start with the commit at the very bottom of the stack of suspect commits; if it doesn't have the regression, go to mid between the top and bottom; once mid way, if it doesn't have the regression, go to half of half; if it did, then go lower but half and that should perhaps help figure out the regression.

@gillg
Copy link
Contributor Author

gillg commented Sep 1, 2021

@odeke-em I will don't have time to dissect and recompile every versions. I try to do my best to help and contribute, I already have a lot of issues with metrics and logs collection to workaround for now 😅
But going deeper I highly suspect this commit to be guilty open-telemetry/opentelemetry-collector@8b79380 ! I can test before and after at least. You introduced the NaN value as stale marker.
If I understand correctly it put a NaN value if a metric disappears between two scrape intervals. It makes sense in theory, but if I'm not wrong the philosophy of Prometheus exporter (especially node which returns tons of metrics) is to answer quickly as much as possible even if some metrics can be present one time on two.
So maybe de stale state could be delayed or configured with a minimal number of scrapes (eg. If the metric is not present after 2 or 3 scrape intervals).

I keep you in touch when I find the time to make some tests.

@odeke-em
Copy link
Member

odeke-em commented Sep 1, 2021

@odeke-em I will don't have time to dissect and recompile every versions. I try to do my best to help and contribute, I already have a lot of issues with metrics and logs collection to workaround for now 😅

But going deeper I highly suspect this commit to be guilty open-telemetry/opentelemetry-collector@8b79380 ! I can test before and after at least. You introduced the NaN value as stale marker.

If I understand correctly it put a NaN value if a metric disappears between two scrape intervals. It makes sense in theory, but if I'm not wrong the philosophy of Prometheus exporter (especially node which returns tons of metrics) is to answer quickly as much as possible even if some metrics can be present one time on two.

So maybe de stale state could be delayed or configured with a minimal number of scrapes (eg. If the metric is not present after 2 or 3 scrape intervals).

I keep you in touch when I find the time to make some tests.

Thanks Gill! If you try out the commit before my change and it returns a desirable result then that helps rule out the cause. As for delaying stale markers, we are trying to emulate what Prometheus does so I'll run end to end tests with Prometheus then compare results. I have a PR to fix staleness markers for replicated/multiple instances but that isn't what's happening for your issue.

@dashpole
Copy link
Contributor

dashpole commented Sep 1, 2021

@gillg I think i might have found a solution, but I haven't tested it against your issue: #5062. Would it be possible for you to try building from my PR, and letting me know if that fixes your issue?

@gillg
Copy link
Contributor Author

gillg commented Sep 2, 2021

@dashpole thank you a lot ! I will try to test it today !

@alolita alolita added the comp:prometheus Prometheus related issues label Sep 2, 2021
@gillg
Copy link
Contributor Author

gillg commented Sep 3, 2021

@dashpole Tested and approved !!!
It's definitely better, no more exception traces in logs and no more NaN values !

@odeke-em
Copy link
Member

odeke-em commented Sep 5, 2021

@alolita @Aneurysm9 @gillg please help close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working comp:prometheus Prometheus related issues
Projects
None yet
Development

No branches or pull requests

5 participants