Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick #17753 to 7.x: Fix prometheus histogram rate overflows #17783

Merged
merged 2 commits into from
Apr 20, 2020

Conversation

jsoriano
Copy link
Member

Cherry-pick of PR #17753 to 7.x branch. Original message:

What does this PR do?

Fix some overflows on Prometheus histogram rate calculations. They could be caused by:

  • New buckets added to existing histograms on runtime, this happens at least with CockroachDB (see Use Elasticsearch types in Cockroachdb module #17736).
  • Buckets with bigger upper limits have lower counters. This is wrong and I have only reproduced this on tests, but handling it just in case to avoid losing other data if this happens with some service.

Rate calculation methods return now also a boolean to be able to differenciate if a zero value is caused because it was the first call, or because it the rate is actually zero.

Why is it important?

These overflows cause big numbers, documents containing these values are rejected by Elasticsearch, causing the loss of these and other metrics in the same documents.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  • Configure Metricbeat to monitor a service that exposes Prometheus metrics and some of these metrics are histograms whose buckets distribution can change along the time. I have only seen this with CockroachDB, issue could be reproduced with a config like the following one:
metricbeat.modules:
  - module: prometheus
    metricsets: ['collector']
    hosts: ['172.21.0.2:8080']
    metrics_path: '/_status/vars'
    use_types: true

output.elasticsearch.hosts: ["http://localhost:9200"]
  • Check that metrics are ingested and no errors appear in Metricbeat logs.

Related issues

Logs

Example error:

2020-04-16T14:13:24.057+0200	WARN	[elasticsearch]	elasticsearch/client.go:385	Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xbf9e30d8bb1146e2, ext:18152263511, loc:(*time.Location)(0x7b0cda0)}, Meta:null, Fields:{"agent":{"ephemeral_id":"f625d601-5fe0-4fd4-834c-a093e836c8ef","hostname":"voyager","id":"9aed890c-78d3-4460-bd44-54c141fc352d","type":"metricbeat","version":"8.0.0"},"ecs":{"version":"1.5.0"},"event":{"dataset":"prometheus.collector","duration":9085986,"module":"prometheus"},"host":{"name":"voyager"},"metricset":{"name":"collector","period":10000},"prometheus":{"addsstable_applications":{"counter":0.000000},"addsstable_copies":{"counter":0.000000},"addsstable_proposals":{"counter":0.000000},"capacity":{"value":107374182400.000000},"capacity_available":{"value":26611531776.000000},"capacity_reserved":{"value":0.000000},"capacity_used":{"value":679969.000000},"compactor_compactingnanos":{"counter":0.000000},"compactor_compactions_failure":{"counter":0.000000},"compactor_compactions_success":{"counter":0.000000},"compactor_suggestionbytes_compacted":{"counter":0.000000},"compactor_suggestionbytes_queued":{"value":0.000000},"compactor_suggestionbytes_skipped":{"counter":0.000000},"follower_reads_success_count":{"counter":0.000000},"gcbytesage":{"value":16466.000000},"intentage":{"value":0.000000},"intentbytes":{"value":0.000000},"intentcount":{"value":0.000000},"intentresolver_async_throttled":{"counter":0.000000},"intents_abort_attempts":{"counter":0.000000},"intents_poison_attempts":{"counter":0.000000},"intents_resolve_attempts":{"counter":30.000000},"keybytes":{"value":33026.000000},"keycount":{"value":725.000000},"kv_closed_timestamp_max_behind_nanos":{"value":36006495619.000000},"kv_rangefeed_catchup_scan_nanos":{"counter":0.000000},"labels":{"instance":"172.22.0.2:8080","job":"prometheus","store":"1"},"lastupdatenanos":{"value":1587039199844044288.000000},"leases_epoch":{"value":18.000000},"leases_error":{"counter":0.000000},"leases_expiration":{"value":2.000000},"leases_success":{"counter":28.000000},"leases_transfers_error":{"counter":0.000000},"leases_transfers_success":{"counter":0.000000},"livebytes":{"value":129018.000000},"livecount":{"value":724.000000},"queue_consistency_pending":{"value":0.000000},"queue_consistency_process_failure":{"counter":0.000000},"queue_consistency_process_success":{"counter":20.000000},"queue_consistency_processingnanos":{"counter":3259882231.000000},"queue_gc_info_abortspanconsidered":{"counter":0.000000},"queue_gc_info_abortspangcnum":{"counter":0.000000},"queue_gc_info_abortspanscanned":{"counter":0.000000},"queue_gc_info_intentsconsidered":{"counter":0.000000},"queue_gc_info_intenttxns":{"counter":0.000000},"queue_gc_info_numkeysaffected":{"counter":0.000000},"queue_gc_info_pushtxn":{"counter":0.000000},"queue_gc_info_resolvesuccess":{"counter":0.000000},"queue_gc_info_resolvetotal":{"counter":0.000000},"queue_gc_info_transactionspangcaborted":{"counter":0.000000},"queue_gc_info_transactionspangccommitted":{"counter":0.000000},"queue_gc_info_transactionspangcpending":{"counter":0.000000},"queue_gc_info_transactionspanscanned":{"counter":0.000000},"queue_gc_pending":{"value":0.000000},"queue_gc_process_failure":{"counter":0.000000},"queue_gc_process_success":{"counter":0.000000},"queue_gc_processingnanos":{"counter":0.000000},"queue_merge_pending":{"value":0.000000},"queue_merge_process_failure":{"counter":0.000000},"queue_merge_process_success":{"counter":0.000000},"queue_merge_processingnanos":{"counter":0.000000},"queue_merge_purgatory":{"value":0.000000},"queue_raftlog_pending":{"value":0.000000},"queue_raftlog_process_failure":{"counter":0.000000},"queue_raftlog_process_success":{"counter":23.000000},"queue_raftlog_processingnanos":{"counter":360904963.000000},"queue_raftsnapshot_pending":{"value":0.000000},"queue_raftsnapshot_process_failure":{"counter":0.000000},"queue_raftsnapshot_process_success":{"counter":0.000000},"queue_raftsnapshot_processingnanos":{"counter":0.000000},"queue_replicagc_pending":{"value":0.000000},"queue_replicagc_process_failure":{"counter":0.000000},"queue_replicagc_process_success":{"counter":0.000000},"queue_replicagc_processingnanos":{"counter":0.000000},"queue_replicagc_removereplica":{"counter":0.000000},"queue_replicate_addreplica":{"counter":0.000000},"queue_replicate_pending":{"value":0.000000},"queue_replicate_process_failure":{"counter":21.000000},"queue_replicate_process_success":{"counter":0.000000},"queue_replicate_processingnanos":{"counter":923795838.000000},"queue_replicate_purgatory":{"value":20.000000},"queue_replicate_rebalancereplica":{"counter":0.000000},"queue_replicate_removedeadreplica":{"counter":0.000000},"queue_replicate_removereplica":{"counter":0.000000},"queue_replicate_transferlease":{"counter":0.000000},"queue_split_pending":{"value":0.000000},"queue_split_process_failure":{"counter":0.000000},"queue_split_process_success":{"counter":0.000000},"queue_split_processingnanos":{"counter":0.000000},"queue_split_purgatory":{"value":0.000000},"queue_tsmaintenance_pending":{"value":0.000000},"queue_tsmaintenance_process_failure":{"counter":0.000000},"queue_tsmaintenance_process_success":{"counter":3.000000},"queue_tsmaintenance_processingnanos":{"counter":619011410.000000},"raft_commandsapplied":{"counter":186.000000},"raft_enqueued_pending":{"value":0.000000},"raft_entrycache_accesses":{"counter":44.000000},"raft_entrycache_bytes":{"value":172159.000000},"raft_entrycache_hits":{"counter":4.000000},"raft_entrycache_size":{"value":173.000000},"raft_heartbeats_pending":{"value":0.000000},"raft_process_applycommitted_latency":{"histogram":{"counts":[0,0,0,0,0,0,1,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9223372036854775801,8,0,0],"values":[40959.5,86015,100351,112639,116735,124927,135167,143359,151551,159743,167935,176127,184319,192511,200703,208895,217087,225279,233471,241663,249855,258047,270335,286719,303103,319487,335871,352255,368639,385023,401407,417791,434175,450559,466943,483327,507903,540671,573439,606207,638975,688127,753663,819199,884735,1015807,1146879,1343487,1605631,2555903,3538943,3932159]}},"raft_process_commandcommit_latency":{"histogram":{"counts":[0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,9223372036854775806,3,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,9223372036854775801,8,0,0],"values":[3199.5,6527,6911,7423,7935,8447,8959,9727,10495,11263,12031,12543,13055,13823,14847,15615,16639,17919,18943,19967,20991,22015,23039,24063,25087,26111,27135,28159,29183,30207,31231,32255,33791,35839,37887,39935,41983,44031,46079,48127,50175,52223,54271,56319,58367,60415,62463,64511,67583,71679,75775,79871,83967,88063,92159,96255,100351,106495,129023,303103,688127,1277951,1769471,2162687]}},"raft_process_handleready_latency":{"histogram":{"counts":[0,2,0,0,9223372036854775806,0,4,9223372036854775804,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"values":[491519.5,999423,1032191,1179647,1409023,1572863,1835007,5996543,10747903,12058623,14155775,20971519,26738687,27787263,28835839,29884415,30932991,31981567,33030143,34603007,36700159,38797311,40894463,42991615,45088767,47185919,49283071,51380223,53477375,55574527,57671679,59768831,61865983,63963135,66060287,69206015,73400319,77594623,81788927,88080383,94371839,98566143,102760447,109051903,115343359,130023423,146800639,159383551,260046847,360710143,385875967]}},"raft_process_logcommit_latency":{"histogram":{"counts":[0,2,0,0,9223372036854775806,0,4,9223372036854775804,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"values":[393215.5,819199,868351,933887,1114111,1343487,1540095,4882431,9830399,11796479,13631487,20185087,25690111,26738687,27787263,28835839,29884415,30932991,31981567,33030143,34603007,36700159,38797311,40894463,42991615,45088767,47185919,49283071,51380223,53477375,55574527,57671679,59768831,61865983,63963135,66060287,69206015,73400319,77594623,81788927,88080383,94371839,98566143,102760447,109051903,127926271,146800639,159383551,251658239,343932927,360710143,385875967]}},"raft_process_tickingnanos":{"counter":1858744.000000},"raft_process_workingnanos":{"counter":8547434611.000000},"raft_rcvd_app":{"counter":0.000000},"raft_rcvd_appresp":{"counter":0.000000},"raft_rcvd_dropped":{"counter":0.000000},"raft_rcvd_heartbeat":{"counter":0.000000},"raft_rcvd_heartbeatresp":{"counter":0.000000},"raft_rcvd_prevote":{"counter":0.000000},"raft_rcvd_prevoteresp":{"counter":0.000000},"raft_rcvd_prop":{"counter":0.000000},"raft_rcvd_snap":{"counter":0.000000},"raft_rcvd_timeoutnow":{"counter":0.000000},"raft_rcvd_transferleader":{"counter":0.000000},"raft_rcvd_vote":{"counter":0.000000},"raft_rcvd_voteresp":{"counter":0.000000},"raft_ticks":{"counter":150.000000},"raftlog_behind":{"value":0.000000},"raftlog_truncated":{"counter":13.000000},"range_adds":{"counter":0.000000},"range_merges":{"counter":0.000000},"range_raftleadertransfers":{"counter":0.000000},"range_removes":{"counter":0.000000},"range_snapshots_generated":{"counter":0.000000},"range_snapshots_normal_applied":{"counter":0.000000},"range_snapshots_preemptive_applied":{"counter":0.000000},"range_splits":{"counter":0.000000},"ranges":{"value":20.000000},"ranges_overreplicated":{"value":0.000000},"ranges_unavailable":{"value":0.000000},"ranges_underreplicated":{"value":20.000000},"rebalancing_lease_transfers":{"counter":0.000000},"rebalancing_queriespersecond":{"value":9.125939},"rebalancing_range_rebalances":{"counter":0.000000},"rebalancing_writespersecond":{"value":69.916395},"replicas":{"value":20.000000},"replicas_leaders":{"value":20.000000},"replicas_leaders_not_leaseholders":{"value":0.000000},"replicas_leaseholders":{"value":20.000000},"replicas_quiescent":{"value":20.000000},"replicas_reserved":{"value":0.000000},"requests_backpressure_split":{"value":0.000000},"requests_slow_latch":{"value":0.000000},"requests_slow_lease":{"value":0.000000},"requests_slow_raft":{"value":0.000000},"rocksdb_block_cache_hits":{"value":0.000000},"rocksdb_block_cache_misses":{"value":0.000000},"rocksdb_block_cache_pinned_usage":{"value":0.000000},"rocksdb_block_cache_usage":{"value":0.000000},"rocksdb_bloom_filter_prefix_checked":{"value":0.000000},"rocksdb_bloom_filter_prefix_useful":{"value":0.000000},"rocksdb_compactions":{"value":0.000000},"rocksdb_encryption_algorithm":{"value":0.000000},"rocksdb_flushes":{"value":0.000000},"rocksdb_memtable_total_size":{"value":724080.000000},"rocksdb_num_sstables":{"value":0.000000},"rocksdb_read_amplification":{"value":0.000000},"rocksdb_table_readers_mem_estimate":{"value":0.000000},"sysbytes":{"value":4630.000000},"syscount":{"value":103.000000},"totalbytes":{"value":129817.000000},"tscache_skl_read_pages":{"value":1.000000},"tscache_skl_read_rotations":{"counter":0.000000},"tscache_skl_write_pages":{"value":1.000000},"tscache_skl_write_rotations":{"counter":0.000000},"txnwaitqueue_deadlocks_total":{"counter":0.000000},"txnwaitqueue_pushee_waiting":{"value":0.000000},"txnwaitqueue_pusher_slow":{"value":0.000000},"txnwaitqueue_pusher_wait_time":{"histogram":{"counts":[0],"values":[0]}},"txnwaitqueue_pusher_waiting":{"value":0.000000},"txnwaitqueue_query_wait_time":{"histogram":{"counts":[0],"values":[0]}},"txnwaitqueue_query_waiting":{"value":0.000000},"valbytes":{"value":96791.000000},"valcount":{"value":736.000000}},"service":{"address":"172.22.0.2:8080","type":"prometheus"}}, Private:interface {}(nil), TimeSeries:true}, Flags:0x0, Cache:publisher.EventCache{m:common.MapStr(nil)}} (status=400): {"type":"mapper_parsing_exception","reason":"failed to parse field [prometheus.raft_process_applycommitted_latency.histogram] of type [histogram]","caused_by":{"type":"input_coercion_exception","reason":"Numeric value (9223372036854775801) out of range of int (-2147483648 - 2147483647)\n at [Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper); line: 1, column: 855]"}}

Fix some overflows on Prometheus histogram rate calculations.
They could be caused by:
* New buckets added to existing histograms on runtime, this happens at
  least with CockroachDB (see elastic#17736).
* Buckets with bigger upper limits have lower counters. This is wrong and
  has been only reproduced this on tests, but handling it just in case to avoid
  losing other data if this happens with some service.

Rate calculation methods return now also a boolean to be able to differenciate
if a zero value is caused because it was the first call, or because it the rate is
actually zero.

(cherry picked from commit 0afffa8)
@jsoriano jsoriano self-assigned this Apr 17, 2020
@jsoriano jsoriano merged commit 7bb588f into elastic:7.x Apr 20, 2020
@jsoriano jsoriano deleted the backport_17753_7.x branch April 20, 2020 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants