Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit DataDog statsd metrics with metadata tags #28961

Merged
merged 10 commits into from
Jan 20, 2023

Conversation

hussein-awala
Copy link
Member


^ Add meaningful description above

In this PR, I'm adding a new config metrics.statsd_datadog_metrics_tags, when it's True, Airflow will emit some of the metrics (the counters) with tags to add some details about the metric source.
This can help the users to create custom dashboards, aggregate and filter the metrics and detect the problems more easily. But activating this feature can increase the datadog cost

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jan 15, 2023
@hussein-awala hussein-awala requested review from uranusjr and removed request for ashb, ephraimbuddy, kaxil, XD-DENG and jedcunningham January 18, 2023 20:10
@sungwy
Copy link
Contributor

sungwy commented Feb 15, 2023

@potiuk @hussein-awala @uranusjr

I've started working with backend folks to add the new metric tags to the backend to be able to read the soon to be published metrics... and I was reminded that cardinality of the metrics is an issue when it comes to the storage space and the retention period of the tags. I'm not sure of the other infrastructures, but for us, the cardinality of a metric is measured as:

number of unique metric names * number of unique application tag pairs

Introducing tags to existing metric names that already have these values concatenated into the metric names doesn't actually increase the cardinality by a lot (it only doubles from duplication of metrics on same events). But as a rule of thumb I think we might benefit from carefully analyzing the potential for cardinality explosion from each new tags.

As an example, my only concern with this PR is the new tag attribute 'run_id' which is unique for every single dag_run, and hence increases the cardinality by the number of unique scheduled dag_runs during a retention period.

This means that for an Airflow instance with 1000 daily jobs, with a metric retention period of 10 days, we are increasing the cardinality of our metrics by 10,000 on just one single metric just by adding this tag alone. If we add this tag to a few other metrics, that could easily result in an explosion of metric cardinality, and storage requirements for a metrics backend. As a benchmark, our allocated quota for metric cardinality is 100,000 per tenancy, and I'm wondering if other tag users may face similar storage-based concerns as well.

Could I get your thoughts on this? Is there room to discuss and potentially backtrack the addition of run_id as a metric tag in the upcoming release?

@sungwy
Copy link
Contributor

sungwy commented Feb 16, 2023

@potiuk @hussein-awala @uranusjr

I've started working with backend folks to add the new metric tags to the backend to be able to read the soon to be published metrics... and I was reminded that cardinality of the metrics is an issue when it comes to the storage space and the retention period of the tags. I'm not sure of the other infrastructures, but for us, the cardinality of a metric is measured as:

number of unique metric names * number of unique application tag pairs

Introducing tags to existing metric names that already have these values concatenated into the metric names doesn't actually increase the cardinality by a lot (it only doubles from duplication of metrics on same events). But as a rule of thumb I think we might benefit from carefully analyzing the potential for cardinality explosion from each new tags.

As an example, my only concern with this PR is the new tag attribute 'run_id' which is unique for every single dag_run, and hence increases the cardinality by the number of unique scheduled dag_runs during a retention period.

This means that for an Airflow instance with 1000 daily jobs, with a metric retention period of 10 days, we are increasing the cardinality of our metrics by 10,000 on just one single metric just by adding this tag alone. If we add this tag to a few other metrics, that could easily result in an explosion of metric cardinality, and storage requirements for a metrics backend. As a benchmark, our allocated quota for metric cardinality is 100,000 per tenancy, and I'm wondering if other tag users may face similar storage-based concerns as well.

Could I get your thoughts on this? Is there room to discuss and potentially backtrack the addition of run_id as a metric tag in the upcoming release?

On that note, I'm wondering if we should review this metric as well: local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code>
I'm seeing this metric taking up most of the storage capacity on our metrics backend for the same reason based on cardinality, with no tags on it! It's taking up 67,000 slots out of a total of 70,000 in our test cluster.

If you think this warrants its own Issue to facilitate more discussion before opening a PR, I'm happy to open one as well.

@hussein-awala
Copy link
Member Author

I'll test it and check with the folks at datadog if the new added tag might cause a problem, then we can decide if we remove it completely or we add a parameter to enable/disable it.

@sungwy
Copy link
Contributor

sungwy commented Feb 16, 2023

Thank you @hussein-awala - appreciate it!

@sungwy
Copy link
Contributor

sungwy commented Feb 16, 2023

I think the concern for 'High-Cardinality' metrics is pretty universal at a quick glance:

Splunk: https://www.splunk.com/en_us/blog/devops/high-cardinality-monitoring-is-a-must-have-for-microservices-and-containers.html

DataDog: https://arapulido.github.io/blog/2021/11/15/understanding-dd-tag-cardinality-in-kubernetes/

And it looks like metric cardinality would directly affect the pricing plan for custom metrics as well.

@potiuk
Copy link
Member

potiuk commented Feb 18, 2023

Yes. I think if we see high-cardinality metrics we could add features to disable them indeed - not sure though if it should be done a single "disable-high-cardinality" metrics or list of metrics to disable. Both have advantages and disadvantages. I think the single flag is more opinionated what is high-cardinality, but it has also the potential on being used in OTEL implementaiton (cc: @feruzzi).

It looks like the discussion on what is/should be cardinality explanation and making the documentation and explanation of it part of the OTEL specification open-telemetry/opentelemetry-specification#2996 so once we get into the OTEL implementation we could also think about it and take part in the discussion.

@sungwy
Copy link
Contributor

sungwy commented Feb 21, 2023

Thank you for that reference @potiuk - I think we can lean in on the fact that OTEL is also trying better document the problems high cardinality metrics pose to the users and justify implementing a solution of our own in Statsd metrics in the interim. I think this discussion has grown sufficiently to warrant a Issue of its own for us to agree on a solution. Will open one up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler type:new-feature Changelog: New Features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants