opentelemetry: JSON Metrics and Exp Histograms#11456
Conversation
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
…ogs/traces
- Added shared content-type/payload helpers in plugins/in_opentelemetry/opentelemetry_utils.{h,c}:
- opentelemetry_is_json_content_type
- opentelemetry_is_protobuf_content_type
- opentelemetry_is_grpc_content_type
- opentelemetry_payload_starts_with_json_object
- Switched logs/traces/metrics paths to use shared helpers:
- plugins/in_opentelemetry/opentelemetry_logs.c
- plugins/in_opentelemetry/opentelemetry_traces.c
- plugins/in_opentelemetry/opentelemetry_prot.c
- Removed duplicated local gRPC content-type checker from opentelemetry_prot.c.
- Improved coherence across signals:
- unified JSON/protobuf/gRPC detection
- accepts content-type params (e.g. application/json; charset=utf-8)
- consistent JSON payload prefix validation for logs/traces.
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 088238d202
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Exponential Histogram support across cmetrics (new type, storage, encode/decode, merge/filter), typed metric value APIs, OTLP JSON→CMT and msgpack→CFL conversion utilities, centralized OpenTelemetry content-type predicates, many tests and test-data, and a cmetrics version bump. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
…ence Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
lib/cmetrics/src/cmt_encode_splunk_hec.c (1)
20-28:⚠️ Potential issue | 🟠 MajorFix exp histogram count conversion before formatting.
metric->hist_countstores a double bit-pattern. Assigningmetric->exp_hist_countdirectly meanscmt_metric_hist_get_count_value()will decode garbage (it treats the bits as a double). Convert the count to the expected double bit representation before formatting.🐛 Proposed fix (convert count to double-bit representation)
@@ -#include <cmetrics/cmt_encode_splunk_hec.h> +#include <cmetrics/cmt_encode_splunk_hec.h> +#include <cmetrics/cmt_math.h> @@ - metric->hist_count = metric->exp_hist_count; + metric->hist_count = cmt_math_d64_to_uint64((double) metric->exp_hist_count);Also applies to: 569-605
lib/cmetrics/src/cmt_encode_opentelemetry.c (1)
38-57:⚠️ Potential issue | 🟠 MajorMissing type check on variant before accessing
as_kvlist.The
fetch_metadata_kvlist_keyincmt_encode_text.c(line 60) validatesentry_variant->type != CFL_VARIANT_KVLISTbefore accessing the kvlist data. This version skips that check, so a non-kvlist variant will be interpreted as a kvlist pointer — likely causing a crash or silent corruption.Proposed fix
entry_variant = cfl_kvlist_fetch(kvlist, key); - if (entry_variant != NULL) { + if (entry_variant != NULL && entry_variant->type == CFL_VARIANT_KVLIST) { entry_kvlist = entry_variant->data.as_kvlist; } else { entry_kvlist = NULL; }
🤖 Fix all issues with AI agents
In `@lib/cmetrics/src/cmt_encode_opentelemetry.c`:
- Around line 410-430: The loop populates exemplars[] with possible NULL holes
because skipped entries use continue but *out_count is set to
array->entry_count; change the loop in the function that builds exemplars (where
create_exemplar_from_kvlist is called) to use a separate write index (e.g.,
dest_idx) so you only store created exemplars compactly, increment dest_idx on
success, and after the loop set *out_exemplars = exemplars and *out_count =
dest_idx; also ensure the existing failure cleanup still frees any allocated
exemplars up to dest_idx and frees exemplars on error.
- Around line 521-524: The block that resets number_data_point->value_case and
overwrites number_data_point->as_double should be removed so we don't clobber
the value previously set (particularly the UINT64 > INT64_MAX path which
deliberately set as_double from cmt_metric_get_uint64_value); locate the check
using number_data_point->value_case ==
OPENTELEMETRY__PROTO__METRICS__V1__NUMBER_DATA_POINT__VALUE_AS_DOUBLE and delete
the statements that reassign value_case and number_data_point->as_double (or
otherwise guard them so they don't overwrite an existing as_double set earlier
in the same function such as in the code around cmt_metric_get_uint64_value and
cmt_metric_get_value).
- Around line 3356-3412: The current loop over context->scope_metrics_list
creates identical metric objects for every scope (duplicating metrics across
scopes); instead create each metric once and attach it only to its originating
scope. Modify the code around the scope_index loop in pack_basic_type (the block
that calls initialize_metric, apply_metric_metadata_from_otlp_context,
append_sample_to_metric, and append_metric_to_scope_metrics) so that you
determine the correct target scope (from map or sample metadata) and either: (a)
build the metric once outside the scope loop and call
append_metric_to_scope_metrics only for the matching scope, or (b) guard the
per-scope creation with a check that the current scope_index equals the metric’s
source scope before creating/adding the metric; use initialize_metric,
apply_metric_metadata_from_otlp_context, append_sample_to_metric, and
append_metric_to_scope_metrics to locate and update the exact spots to implement
this filter.
In `@lib/cmetrics/src/cmt_exp_histogram.c`:
- Around line 125-155: The current logic frees metric->exp_hist_positive_buckets
and metric->exp_hist_negative_buckets early and may leave metric with new
positive buckets but no negative buckets on allocation failure; change to
allocate temporary buffers first (e.g., temp_pos =
calloc(positive_bucket_count,...), temp_neg =
calloc(negative_bucket_count,...)), check both allocations succeed, memcpy into
temporaries, then free the old
metric->exp_hist_positive_buckets/exp_hist_negative_buckets and assign
metric->exp_hist_positive_buckets = temp_pos and
metric->exp_hist_negative_buckets = temp_neg and update
metric->exp_hist_positive_count and metric->exp_hist_negative_count (and any
related scalar fields) only after the swap; ensure you free any allocated
temporaries on error before returning -1 so the metric remains consistent.
In `@lib/cmetrics/src/cmt_metric.c`:
- Around line 60-106: The value_type field is currently stored before the typed
value fields which can make readers (via cmt_metric_get_value_snapshot) see the
new type with stale values; for each setter (cmt_metric_set_double,
cmt_metric_set_int64, cmt_metric_set_uint64 and any add() implementation),
reorder the atomic stores so you store val/val_int64/val_uint64/timestamp first
and perform cmt_atomic_store(&metric->value_type, CMT_METRIC_VALUE_...) as the
last store to ensure snapshots are consistent and type acts as the stable
version marker.
In `@lib/cmetrics/tests/exp_histogram.c`:
- Around line 373-477: The variables encoded_cloudwatch_emf,
encoded_remote_write, and encoded_opentelemetry are not initialized and their
corresponding destroy functions (cmt_encode_cloudwatch_emf_destroy,
cmt_encode_prometheus_remote_write_destroy, cmt_encode_opentelemetry_destroy)
are called unconditionally; initialize these pointers to NULL before calling
cmt_encode_cloudwatch_emf_create, cmt_encode_prometheus_remote_write_create, and
cmt_encode_opentelemetry_create, and only call their destroy functions if the
pointer is non-NULL (i.e., guard each destroy with a NULL-check) to avoid
freeing uninitialized/invalid memory.
In `@plugins/in_opentelemetry/opentelemetry_utils.c`:
- Around line 157-228: The code reads content_type[expected_length] and
content_type[16] without ensuring the string is long enough; update
opentelemetry_content_type_matches to compute size_t content_len =
strlen(content_type) and return FLB_FALSE if content_len < expected_length
before accessing content_type[expected_length], and update
opentelemetry_is_grpc_content_type to check content_len = strlen(content_type)
and return FLB_FALSE if content_len < 16 before accessing content_type[16]; keep
existing strncasecmp checks but perform the length checks first to avoid
out-of-bounds reads.
In `@src/opentelemetry/flb_opentelemetry_utils.c`:
- Around line 624-647: The function parse_u64_from_msgpack_object currently
treats non-numeric strings and negative values as success (producing 0), so
update parse_u64_from_msgpack_object to explicitly reject negative integers and
invalid string conversions: for MSGPACK_OBJECT_NEGATIVE_INTEGER immediately
return error when object->via.i64 < 0; for MSGPACK_OBJECT_STR call
flb_otel_utils_convert_string_number_to_u64 but validate the result (ensure the
input string is numeric, not prefixed with '-', and the conversion indicates
success) and return error on failure instead of unconditionally writing 0; apply
the same defensive validation logic to the analogous conversion block in the
other function that uses flb_otel_utils_convert_string_number_to_u64.
🧹 Nitpick comments (7)
tests/internal/data/opentelemetry/metrics.json (1)
864-910: Misleadinginvalid_prefix on test cases that expect successful decoding.
invalid_exponential_histogram_count_mismatch(line 864) andinvalid_exponential_histogram_inconsistent_layout(line 911) are prefixed withinvalid_but both have"expected"blocks (i.e., they decode successfully). This could confuse future maintainers into thinking these should produce errors. Consider renaming to a prefix likevalid_exponential_histogram_count_mismatchorquirky_/lenient_to indicate they are accepted despite being semantically odd.lib/cmetrics/include/cmetrics/cmt_variant_utils.h (1)
513-532: Pre-existing asymmetry:unpack_cfl_variant_uint64creates an INT64 variant from unsigned data.Not introduced by this PR, but now more relevant since
pack_cfl_variantcorrectly handlesCFL_VARIANT_UINT→mpack_write_u64. The unpack path convertsmpack_type_uintback toCFL_VARIANT_INT(viacfl_variant_create_from_int64with a cast), so a round-trip of aCFL_VARIANT_UINTvalue >INT64_MAXwill silently corrupt data. Worth addressing in a follow-up.lib/cmetrics/include/cmetrics/cmt_metric.h (1)
34-37:val_int64declared asuint64_tis intentional but warrants a brief comment.All three fields (
val,val_int64,val_uint64) useuint64_tfor uniform atomic operations, butval_int64storing an int64 bit pattern in a uint64 container is non-obvious. A one-line comment would help future readers.📝 Suggested comment
/* counters and gauges */ uint64_t val; uint64_t value_type; - uint64_t val_int64; + uint64_t val_int64; /* int64 value stored as uint64 for atomic ops */ uint64_t val_uint64;lib/cmetrics/src/cmt_encode_prometheus_remote_write.c (1)
892-909: Conversion failure mapped toALLOCATION_ERRORregardless of actual cause.
cmt_exp_histogram_to_explicitcan fail for non-allocation reasons (e.g., non-finite base value,base <= 1.0). Mapping all failures toCMT_ENCODE_PROMETHEUS_REMOTE_WRITE_ALLOCATION_ERRORobscures the real cause. Consider introducing a more general error code or at least adding a comment.lib/cmetrics/src/cmt_encode_prometheus.c (1)
470-520: Consider extracting a helper for exp‑histogram → explicit conversion.The conversion logic is duplicated for static and dynamic metric paths; factoring it into a small helper would reduce drift risk and simplify future changes.
Also applies to: 546-592
lib/cmetrics/src/cmt_decode_opentelemetry.c (1)
1314-1547: Consider extracting a helper for repeated metric‑metadata insertion.The
metric->n_metadatacloning block is repeated for each metric type; a small helper would reduce duplication and keep behavior consistent.lib/cmetrics/src/cmt_decode_msgpack.c (1)
820-832: Missing bounds check in exp histogram per-bucket callbacks.
unpack_exp_histogram_positive_bucketandunpack_exp_histogram_negative_bucketindex directly into the allocated bucket arrays without validatingindex < exp_hist_positive_count/exp_hist_negative_count. While the bounds are implicitly safe (both the allocator and the iterator use the same peeked array length), the regular histogram bucket callback at line 699–702 performs explicit bounds checking for defense-in-depth. Adding similar guards here would be consistent and protect against future refactoring.Proposed fix
static int unpack_exp_histogram_positive_bucket(mpack_reader_t *reader, size_t index, void *context) { struct cmt_msgpack_decode_context *decode_context; decode_context = (struct cmt_msgpack_decode_context *) context; + + if (decode_context->metric == NULL || + decode_context->metric->exp_hist_positive_buckets == NULL || + index >= decode_context->metric->exp_hist_positive_count) { + return CMT_DECODE_MSGPACK_CORRUPT_INPUT_DATA_ERROR; + } + return cmt_mpack_consume_uint_tag(reader, &decode_context->metric->exp_hist_positive_buckets[index]); } static int unpack_exp_histogram_negative_bucket(mpack_reader_t *reader, size_t index, void *context) { struct cmt_msgpack_decode_context *decode_context; decode_context = (struct cmt_msgpack_decode_context *) context; + + if (decode_context->metric == NULL || + decode_context->metric->exp_hist_negative_buckets == NULL || + index >= decode_context->metric->exp_hist_negative_count) { + return CMT_DECODE_MSGPACK_CORRUPT_INPUT_DATA_ERROR; + } + return cmt_mpack_consume_uint_tag(reader, &decode_context->metric->exp_hist_negative_buckets[index]); }
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/cmetrics/src/cmt_encode_prometheus_remote_write.c (1)
988-999:⚠️ Potential issue | 🔴 CriticalPre-existing null-pointer dereference if
create_labelreturns NULL.If
create_label(NULL)fails,resultis set to the error code butcfl_list_add(&dummy_label->_head, ...)on line 998 still executes with a NULLdummy_label. This is a pre-existing issue (same pattern exists for the HISTOGRAM and SUMMARY paths), not introduced by this PR, but the new EXP_HISTOGRAM code path inherits it.🐛 Suggested fix
dummy_label = create_label(NULL); if (dummy_label == NULL) { result = CMT_ENCODE_PROMETHEUS_REMOTE_WRITE_ALLOCATION_ERROR; + break; } cfl_list_add(&dummy_label->_head, &metric->labels);
🤖 Fix all issues with AI agents
In `@tests/internal/opentelemetry.c`:
- Around line 1359-1361: The variable ret is reused after calling
flb_opentelemetry_metrics_json_to_cmt, which causes
destroy_metrics_context_list(&context_list) to be skipped if later validation
calls (test_extract_label_values, cmt_gauge_get_val, cmt_counter_get_val, etc.)
modify ret; fix this by capturing the decode outcome immediately after
flb_opentelemetry_metrics_json_to_cmt (e.g., int decode_ret = ret or int
decode_status = flb_ret) and then continue using ret for subsequent validation
checks, but use decode_ret when deciding whether to call
destroy_metrics_context_list(&context_list) (i.e., call
destroy_metrics_context_list only if decode_ret ==
CMT_DECODE_OPENTELEMETRY_SUCCESS).
🧹 Nitpick comments (1)
tests/internal/opentelemetry.c (1)
775-825:run_metrics_caseis a ~590-line monolithic function.This function handles validation for all metric types (gauge, counter, histogram, summary, exp_histogram) in one flat block. While it works, the repetitive pattern for each metric type (find key → check name → check value → check labels) could be extracted into helpers, improving readability and making it easier to add new metric types.
Not blocking, just noting for future maintainability.
|
@codex review |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/cmetrics/src/cmt_encode_opentelemetry.c (1)
38-57:⚠️ Potential issue | 🔴 CriticalMissing variant type check — accessing
as_kvliston a non-kvlist variant is undefined behavior.
cfl_kvlist_fetchmay return a variant of any type. This function accessesentry_variant->data.as_kvlistwithout verifyingentry_variant->type == CFL_VARIANT_KVLIST. Compare with the equivalent helper incmt_encode_text.c(lines 51–65 in the relevant snippets), which guards withentry_variant->type != CFL_VARIANT_KVLIST.If the key maps to a non-kvlist variant (e.g., string, int), this returns a garbage pointer that callers will dereference.
Proposed fix — add the type guard
struct cfl_kvlist *fetch_metadata_kvlist_key(struct cfl_kvlist *kvlist, char *key) { struct cfl_variant *entry_variant; - struct cfl_kvlist *entry_kvlist; if (kvlist == NULL) { return NULL; } entry_variant = cfl_kvlist_fetch(kvlist, key); - - if (entry_variant != NULL) { - entry_kvlist = entry_variant->data.as_kvlist; - } - else { - entry_kvlist = NULL; + if (entry_variant == NULL || entry_variant->type != CFL_VARIANT_KVLIST) { + return NULL; } - return entry_kvlist; + return entry_variant->data.as_kvlist; }
🧹 Nitpick comments (4)
lib/cmetrics/src/cmt_metric.c (1)
44-62: Trailing stores inadd()race with concurrent callers.After the CAS loop on
metric->valsucceeds (line 54), lines 58–61 execute without mutual exclusion. Two concurrentadd()threads can interleave these stores, leavingval_int64/val_uint64out of sync with the CAS'dval. Becauseadd()always setsvalue_type = DOUBLE,cmt_metric_get_value_snapshotwon't detect the inconsistency (type stays the same across the tear).If
val_int64/val_uint64are only informational for the DOUBLE path this is tolerable, but worth a brief comment in the source so future maintainers don't rely on those fields for precise values afteradd().lib/cmetrics/src/cmt_encode_opentelemetry.c (2)
445-471:compute_flat_scope_indexreturns 0 for both "valid first scope" and "error" — callers cannot distinguish.When
contextis NULL,resource_indexis out of bounds, orscope_indexis out of bounds, this returns 0 — the same value as a legitimate first-scope index.resolve_target_scope_index(line 498) validates the result againstscope_metrics_count, butcompute_flat_scope_indexitself masks errors from its callers. Consider returning a sentinel (e.g.,SIZE_MAX) for errors, or an out-parameter pattern, so that callers can detect invalid input.
3019-3027: Redundantmemsetaftercalloc.
callocalready zero-initializes. Thememseton line 3027 is a no-op.Remove redundant memset
context = calloc(1, sizeof(struct cmt_opentelemetry_context)); if (context == NULL) { result = CMT_ENCODE_OPENTELEMETRY_ALLOCATION_ERROR; goto cleanup; } - memset(context, 0, sizeof(struct cmt_opentelemetry_context)); - context->cmt = cmt;lib/cmetrics/tests/exp_histogram.c (1)
773-789: Unconditionalcmt_encode_prometheus_remote_write_destroywhenencoded_remote_writemay be NULL.If
cmt_encode_prometheus_remote_write_createreturns NULL (line 773), execution falls through to line 788 which calls destroy on NULL. This likely works because the underlyingcfl_sds_destroytolerates NULL, but for consistency with the guarded pattern used intest_exp_histogram_encoder_smoke(lines 498–500), consider guarding it.Guard the destroy call
- cmt_encode_prometheus_remote_write_destroy(encoded_remote_write); + if (encoded_remote_write != NULL) { + cmt_encode_prometheus_remote_write_destroy(encoded_remote_write); + }
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 80699d632a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Fix two JSON metrics decode issues in flb_opentelemetry_metrics.c.
- Parse `asInt` datapoints as exact int64 instead of double conversion.
- Set gauge/counter samples with typed cmetrics setters to avoid >2^53
rounding loss.
- Keep counter reset behavior consistent by clamping negative int values
when reset is not allowed.
- Preserve resource/scope metadata from JSON payloads by cloning:
- resource metadata/attributes
- resource_metrics schema_url
- scope metadata/attributes
- scope_metrics schema_url
into `context->external_metadata` for OTLP re-encoding
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
|
CI is stuck in windows in a oauth2 test (not related to this work) |
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
New Features
Tests
Chores