Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilize tags for metrics sent to SafeDogStatsdLogger #8743

Closed
williamBartos opened this issue May 6, 2020 · 8 comments
Closed

Utilize tags for metrics sent to SafeDogStatsdLogger #8743

williamBartos opened this issue May 6, 2020 · 8 comments

Comments

@williamBartos
Copy link

williamBartos commented May 6, 2020

Description

A recent pr enabled dogstatsd support for Airflow metrics: #7376. While this enables the use of dogstatsd, the code sending metrics to SafeDogStatsdLogger doesn't utilize tagging and instead sends a unique, monolithic metric that cant be aggregated across identifiers such as <dag_id>. This isn't scalable when someone wants to monitor metrics across multiple DAG as each metric sent by each DAG is unique. The amount of monitors increases with the amount of DAGs.

An example here are the timer metrics sent by a DagRun, such as dagrun.duration.failed.<dag_id>. When sent by the DagRun object, <dag_id> isn't a tag but part of the entire metric itself: https://github.com/apache/airflow/blob/master/airflow/models/dagrun.py#L412-L420

What is the problem here?

By sending metrics to DataDog without tags, it becomes impossible to aggregate metrics across <dag_id> because each dagrun.duration.failed.<dag_id> sent by a DAG is completely unique to that <dag_id>.

If I have 20 dags in production and want to monitor dagrun.duration.failed.<dag_id>, that means I'll need 20 separate monitors!

image

But if <dag_id> is sent as a tag, a single monitor could be used and DataDog can group the metric by <dag_id>.

Use case / motivation

The current way metrics are sent to DataDog isn't scalable as its preventing a user from aggregating common metrics across unique tags.

Following the DagRun example given above, the information needed to send this metric as a tag is available. Given this line of code: https://github.com/apache/airflow/blob/master/airflow/models/dagrun.py#L418 and the accompanying function definition: https://github.com/apache/airflow/blob/master/airflow/stats.py#L172 we can modify the function call to send <dag_id> as a tag:

toy example:

duration = (self.end_date - self.start_date)
if self.state is State.SUCCESS:
    if isinstance(Stats, SafeDogStatsdLogger)
        Stats.timing('dagrun.duration.success', duration, tags=[self.dag_id])
    else:
        Stats.timing('dagrun.duration.success.{}'.format(self.dag_id), duration)

The preference here is probably not to do type checking before submitting the metric. I'm willing to discuss other solutions here or as part of a PR, and to implement the agreed upon solution.

Related Issues

This is the ticket that created the SafeDogStatsdLogger class: #7376

@williamBartos williamBartos added the kind:feature Feature Requests label May 6, 2020
@boring-cyborg
Copy link

boring-cyborg bot commented May 6, 2020

Thanks for opening your first issue here! Be sure to follow the issue template!

@arbass22
Copy link

arbass22 commented Feb 19, 2021

I was also thinking about this. As a simple use case (which sounds similar to your use case), it would be great if the airflow.ti_successes and airflow.ti_failures where tagged with the DAG and task. Maybe even the worker host name/pool?

@mik-laj
Copy link
Member

mik-laj commented Feb 26, 2021

Related: #12158

@williamBartos
Copy link
Author

Great! I missed this notification but ill definitely be checking it out when I get the chance

@arbass22
Copy link

@williamBartos I stumbled upon this today which could be a good work around: https://docs.datadoghq.com/developers/dogstatsd/dogstatsd_mapper/. Also seems like this use case is common enough that airflow metrics are the actual example that DataDog uses

@potiuk
Copy link
Member

potiuk commented May 18, 2022

airflow metrics are the actual example that DataDog uses

Nice.

Also I think the right solution to that is to add open-telemetry support rather than extend Airflow's DataDog StatsD integration. We have a nice AIP (Airflow Improvement Proposal): https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow and discussion https://lists.apache.org/thread/1sdz7mx8j495h5vddngrwpxbx03p04z6 where any comments are welcome. The proposal there covers much more comprehensive approach for the Open-Telemetry integration and I think we should rather focus on makitng it happen rather than changing our DataDog or StatsD integration.

DataDog already support OTel integration and once we implement OTel in Airlfow, not only you will be able to plugin the metrics but also all other telemetry data including detailed metrics of Airflow internals, Traces that will allow you to trace task and dag execution across multiple components as well as (in the future) logging. And not only with Datadog but with all the other telemetry services.

@potiuk
Copy link
Member

potiuk commented May 18, 2022

cc: @howardyoo

@hussein-awala
Copy link
Member

Closed by #28961

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants