Emit DataDog statsd metrics with metadata tags #28961

hussein-awala · 2023-01-15T19:48:32Z

^ Add meaningful description above

In this PR, I'm adding a new config metrics.statsd_datadog_metrics_tags, when it's True, Airflow will emit some of the metrics (the counters) with tags to add some details about the metric source.
This can help the users to create custom dashboards, aggregate and filter the metrics and detect the problems more easily. But activating this feature can increase the datadog cost

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

airflow/config_templates/config.yml

airflow/dag_processing/processor.py

sungwy · 2023-02-15T20:29:56Z

@potiuk @hussein-awala @uranusjr

I've started working with backend folks to add the new metric tags to the backend to be able to read the soon to be published metrics... and I was reminded that cardinality of the metrics is an issue when it comes to the storage space and the retention period of the tags. I'm not sure of the other infrastructures, but for us, the cardinality of a metric is measured as:

number of unique metric names * number of unique application tag pairs

Introducing tags to existing metric names that already have these values concatenated into the metric names doesn't actually increase the cardinality by a lot (it only doubles from duplication of metrics on same events). But as a rule of thumb I think we might benefit from carefully analyzing the potential for cardinality explosion from each new tags.

As an example, my only concern with this PR is the new tag attribute 'run_id' which is unique for every single dag_run, and hence increases the cardinality by the number of unique scheduled dag_runs during a retention period.

This means that for an Airflow instance with 1000 daily jobs, with a metric retention period of 10 days, we are increasing the cardinality of our metrics by 10,000 on just one single metric just by adding this tag alone. If we add this tag to a few other metrics, that could easily result in an explosion of metric cardinality, and storage requirements for a metrics backend. As a benchmark, our allocated quota for metric cardinality is 100,000 per tenancy, and I'm wondering if other tag users may face similar storage-based concerns as well.

Could I get your thoughts on this? Is there room to discuss and potentially backtrack the addition of run_id as a metric tag in the upcoming release?

sungwy · 2023-02-16T16:50:45Z

@potiuk @hussein-awala @uranusjr

I've started working with backend folks to add the new metric tags to the backend to be able to read the soon to be published metrics... and I was reminded that cardinality of the metrics is an issue when it comes to the storage space and the retention period of the tags. I'm not sure of the other infrastructures, but for us, the cardinality of a metric is measured as:
number of unique metric names * number of unique application tag pairs
Introducing tags to existing metric names that already have these values concatenated into the metric names doesn't actually increase the cardinality by a lot (it only doubles from duplication of metrics on same events). But as a rule of thumb I think we might benefit from carefully analyzing the potential for cardinality explosion from each new tags.

As an example, my only concern with this PR is the new tag attribute 'run_id' which is unique for every single dag_run, and hence increases the cardinality by the number of unique scheduled dag_runs during a retention period.

This means that for an Airflow instance with 1000 daily jobs, with a metric retention period of 10 days, we are increasing the cardinality of our metrics by 10,000 on just one single metric just by adding this tag alone. If we add this tag to a few other metrics, that could easily result in an explosion of metric cardinality, and storage requirements for a metrics backend. As a benchmark, our allocated quota for metric cardinality is 100,000 per tenancy, and I'm wondering if other tag users may face similar storage-based concerns as well.

Could I get your thoughts on this? Is there room to discuss and potentially backtrack the addition of run_id as a metric tag in the upcoming release?

On that note, I'm wondering if we should review this metric as well: local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code>
I'm seeing this metric taking up most of the storage capacity on our metrics backend for the same reason based on cardinality, with no tags on it! It's taking up 67,000 slots out of a total of 70,000 in our test cluster.

If you think this warrants its own Issue to facilitate more discussion before opening a PR, I'm happy to open one as well.

hussein-awala · 2023-02-16T17:00:16Z

I'll test it and check with the folks at datadog if the new added tag might cause a problem, then we can decide if we remove it completely or we add a parameter to enable/disable it.

sungwy · 2023-02-16T17:15:37Z

Thank you @hussein-awala - appreciate it!

sungwy · 2023-02-16T17:26:09Z

I think the concern for 'High-Cardinality' metrics is pretty universal at a quick glance:

Splunk: https://www.splunk.com/en_us/blog/devops/high-cardinality-monitoring-is-a-must-have-for-microservices-and-containers.html

DataDog: https://arapulido.github.io/blog/2021/11/15/understanding-dd-tag-cardinality-in-kubernetes/

And it looks like metric cardinality would directly affect the pricing plan for custom metrics as well.

potiuk · 2023-02-18T12:10:29Z

Yes. I think if we see high-cardinality metrics we could add features to disable them indeed - not sure though if it should be done a single "disable-high-cardinality" metrics or list of metrics to disable. Both have advantages and disadvantages. I think the single flag is more opinionated what is high-cardinality, but it has also the potential on being used in OTEL implementaiton (cc: @feruzzi).

It looks like the discussion on what is/should be cardinality explanation and making the documentation and explanation of it part of the OTEL specification open-telemetry/opentelemetry-specification#2996 so once we get into the OTEL implementation we could also think about it and take part in the discussion.

sungwy · 2023-02-21T14:14:50Z

Thank you for that reference @potiuk - I think we can lean in on the fact that OTEL is also trying better document the problems high cardinality metrics pose to the users and justify implementing a solution of our own in Statsd metrics in the interim. I think this discussion has grown sufficiently to warrant a Issue of its own for us to agree on a solution. Will open one up

hussein-awala added 4 commits January 14, 2023 22:37

improve Stats tags argument

5117760

add some counters tags

8f96b42

add a conf to enable datadog tags

c61777a

add UTests

9844ce3

hussein-awala requested review from jedcunningham, ephraimbuddy, kaxil, XD-DENG and ashb as code owners January 15, 2023 19:48

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jan 15, 2023

hussein-awala added 2 commits January 16, 2023 00:14

update tests

a6c3ee1

Merge branch 'main' into feta/stats_tags

4998a0c

uranusjr reviewed Jan 16, 2023

View reviewed changes

airflow/config_templates/config.yml Outdated Show resolved Hide resolved

uranusjr reviewed Jan 16, 2023

View reviewed changes

airflow/dag_processing/processor.py Outdated Show resolved Hide resolved

hussein-awala added 2 commits January 17, 2023 00:05

change the default value to True

1221252

fix callback name attribute

73c8596

hussein-awala requested review from uranusjr and removed request for ashb, ephraimbuddy, kaxil, XD-DENG and jedcunningham January 18, 2023 20:10

hussein-awala added 2 commits January 18, 2023 21:12

Merge branch 'main' into feta/stats_tags

803869c

fix UTest

ce41d1e

uranusjr approved these changes Jan 19, 2023

View reviewed changes

potiuk approved these changes Jan 20, 2023

View reviewed changes

potiuk merged commit c0d9862 into apache:main Jan 20, 2023

maggesssss pushed a commit to maggesssss/airflow that referenced this pull request Jan 21, 2023

Emit DataDog statsd metrics with metadata tags (apache#28961)

e5009f9

sungwy mentioned this pull request Jan 21, 2023

Enable tagged metric names for existing Statsd metric publishing events | influxdb-statsd support #29093

Merged

hussein-awala mentioned this pull request Feb 5, 2023

Statsd: Allow configuring custom tag for instance identification #29363

Closed

2 tasks

potiuk mentioned this pull request Feb 13, 2023

[Feature] Setup dashboard for Airflow monitoring apache/skywalking#10341

Closed

3 tasks

sungwy mentioned this pull request Feb 21, 2023

Option to Disable High Cardinality Metrics on Statsd #29663

Closed

2 tasks

hussein-awala mentioned this pull request Mar 12, 2023

Utilize tags for metrics sent to SafeDogStatsdLogger #8743

Closed

pierrejeambrun added this to the Airflow 2.6.0 milestone Mar 22, 2023

pierrejeambrun added the type:new-feature Changelog: New Features label Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit DataDog statsd metrics with metadata tags #28961

Emit DataDog statsd metrics with metadata tags #28961

hussein-awala commented Jan 15, 2023

sungwy commented Feb 15, 2023 •

edited

Loading

sungwy commented Feb 16, 2023 •

edited

Loading

hussein-awala commented Feb 16, 2023

sungwy commented Feb 16, 2023

sungwy commented Feb 16, 2023

potiuk commented Feb 18, 2023

sungwy commented Feb 21, 2023 •

edited

Loading

Emit DataDog statsd metrics with metadata tags #28961

Emit DataDog statsd metrics with metadata tags #28961

Conversation

hussein-awala commented Jan 15, 2023

sungwy commented Feb 15, 2023 • edited Loading

sungwy commented Feb 16, 2023 • edited Loading

hussein-awala commented Feb 16, 2023

sungwy commented Feb 16, 2023

sungwy commented Feb 16, 2023

potiuk commented Feb 18, 2023

sungwy commented Feb 21, 2023 • edited Loading

sungwy commented Feb 15, 2023 •

edited

Loading

sungwy commented Feb 16, 2023 •

edited

Loading

sungwy commented Feb 21, 2023 •

edited

Loading