Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DSIP-8][Metrics] Improve DolphinScheduler Monitoring #9324

Open
9 of 14 tasks
Tracked by #14102
EricGao888 opened this issue Apr 2, 2022 · 24 comments
Open
9 of 14 tasks
Tracked by #14102

[DSIP-8][Metrics] Improve DolphinScheduler Monitoring #9324

EricGao888 opened this issue Apr 2, 2022 · 24 comments
Assignees
Labels
DSIP feature new feature help wanted Extra attention is needed metrics

Comments

@EricGao888
Copy link
Member

EricGao888 commented Apr 2, 2022

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

  • Monitoring plays an essential role in software stability. However, there is only statics but no metrics in Dolphin Scheduler at present, which means users cannot export metrics to external observation system to monitor their workflows, tasks, as well as DS performance.
  • However, to match our slogan Choose good tools, Back home early. Use Right Scheduler, Sleep Tight. we need richer metrics to increase monitoring ability and give our users better experience using Dolphinscheduler, especially in production environment.
  • Here are the Email Thread and Proposal.

Use case

  • To make the expected improvement described in Description section happen, we could take three steps:
  1. List all the metrics we need classified by different parts of Dolphinscheduler, such as master, worker, api server, etc. Here's the doc link for metrics list.
  2. Apply the code in the right place and collect these metrics with our metrics-collection frame.
  3. Find a method to expose these metrics to external system. related: [Improvement][Common] Use JMX to expose configuration and metrics #5255

Action Items

Stage I

Stage II

  • Make external monitoring system configurable and extensible.
  • Add popular exporters supported by Micrometer besides Prometheus, such as CloudWatch, Datadog, StatsD, Influx, JMX, Elastic, etc. For a full list, visit Micrometer Setup section. In addition, to provide users with smooth experience, we should add docker yaml files for each exporter for the demo purpose.

Stage III

Related issues

related: #5255

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@EricGao888 EricGao888 added feature new feature Waiting for reply Waiting for reply labels Apr 2, 2022
@github-actions
Copy link

github-actions bot commented Apr 2, 2022

Hi:

@SbloodyS
Copy link
Member

SbloodyS commented Apr 2, 2022

I think it's better to including the number of threads related to the execution of the worker and master in the monitoring.

@EricGao888
Copy link
Member Author

I just updated the google doc in the Use Case section, taking some metrics into consideration.

Another thing I propose we could think about is the granularity of metrics. I find current metrics are general statistics. Statistics of tasks and workflows are separated. We may need some metric like task.duration.<workflow_id>.<task_id> to monitor vital workflows/tasks more accurately. Of course, a side-effect is we will generate explosive number of metrics, leading to some performance issue. To avoid this, two methods will work:

  1. There will be some config for users to switch on/off generating metrics.
  2. Dolphin will send those metrics in a UDP way.

@EricGao888
Copy link
Member Author

Besides, we need some descriptions for exiting metrics in official docs. #9441

@ruanwenjun
Copy link
Member

@EricGao888 Hi, I close #5255, since there is already a module dolphinscheduler-meter can expose the metrics, and I will take part in this work to provide some common method.

@EricGao888 EricGao888 changed the title [Improvement][Common] Improve monitoring of Dolphinscheduler [Improvement][Common] Improve DolphinScheduler Monitoring Jun 13, 2022
@EricGao888 EricGao888 changed the title [Improvement][Common] Improve DolphinScheduler Monitoring [Improvement][Metrics] Improve DolphinScheduler Monitoring Jun 21, 2022
@EricGao888 EricGao888 changed the title [Improvement][Metrics] Improve DolphinScheduler Monitoring [Feature][Metrics] Improve DolphinScheduler Monitoring Jun 21, 2022
@SbloodyS
Copy link
Member

I think this issue is worth DSIP label. WDYT? @zhongjiajie

@EricGao888
Copy link
Member Author

EricGao888 commented Jun 21, 2022

@devosend Hello, may I ask whether it is possible to include the three PRs of stage I in beta-2? In this way, we could get feedback from users in advance and resolve more potential issues before 3.0.0-stable. WDYT

@zhongjiajie
Copy link
Member

I think this issue is worth DSIP label. WDYT? @zhongjiajie

Agrees with that, we should add DSIP for this

@SbloodyS SbloodyS added the DSIP label Jun 22, 2022
@zhongjiajie
Copy link
Member

@EricGao888 Could you follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html guide to make it like DSIP?

@SbloodyS SbloodyS changed the title [Feature][Metrics] Improve DolphinScheduler Monitoring [DSIP-8][Metrics] Improve DolphinScheduler Monitoring Jun 22, 2022
@SbloodyS SbloodyS added the help wanted Extra attention is needed label Jun 22, 2022
@zhongjiajie
Copy link
Member

@EricGao888 Could you follow the https://dolphinscheduler.apache.org/en-us/community/DSIP.html guide to make it like DSIP?

Oh, I remenber you already discuss with an e-mail about the monitoring in https://lists.apache.org/thread/6sogjh6k7f2hv954mhn24c94l2mzwgsz, maybe you should append some words and tell users we want to covert it to DSIP now

EricGao888 added a commit to EricGao888/dolphinscheduler that referenced this issue Jul 11, 2022
EricGao888 added a commit to EricGao888/dolphinscheduler that referenced this issue Jul 11, 2022
EricGao888 added a commit to EricGao888/dolphinscheduler that referenced this issue Jul 11, 2022
EricGao888 added a commit to EricGao888/dolphinscheduler that referenced this issue Jul 11, 2022
EricGao888 added a commit to EricGao888/dolphinscheduler that referenced this issue Jul 11, 2022
EricGao888 added a commit to EricGao888/dolphinscheduler that referenced this issue Jul 11, 2022
EricGao888 added a commit to EricGao888/dolphinscheduler that referenced this issue Jul 11, 2022
EricGao888 added a commit to EricGao888/dolphinscheduler that referenced this issue Jul 11, 2022
EricGao888 added a commit to EricGao888/dolphinscheduler that referenced this issue Jul 11, 2022
EricGao888 added a commit to EricGao888/dolphinscheduler that referenced this issue Jul 11, 2022
EricGao888 added a commit to EricGao888/dolphinscheduler that referenced this issue Jul 11, 2022
EricGao888 added a commit to EricGao888/dolphinscheduler that referenced this issue Jul 11, 2022
EricGao888 added a commit to EricGao888/dolphinscheduler that referenced this issue Jul 11, 2022
EricGao888 added a commit to EricGao888/dolphinscheduler that referenced this issue Jul 11, 2022
EricGao888 added a commit that referenced this issue Jul 12, 2022
…10749)

* [Feature][Metrics] Add resource download related metrics for workers (#9324)

* [Feature][Metrics] Fix bugs and add grafana demos for worker resource download metrics (#9324)

* [Feature][Metrics] Add docs to resource related metrics (#9324)

* [Feature][Metrics] Use tags to indicate status in metrics (#9324)

* [Feature][Metrics] Fix demos, docs and remove redundant code (#9324)

* [Feature][Metrics] Remove .pnpm-debug.log (#9324)

* [Feature][Metrics] Fix style check (#9324)

* [Feature][Metrics] Replace KB with bytes for the unit of resource file size in metrics (#9324)

* [Feature][Metrics] Make code neat (#9324)
zhongjiajie pushed a commit to zhongjiajie/dolphinscheduler that referenced this issue Jul 12, 2022
…pache#10749)

* [Feature][Metrics] Add resource download related metrics for workers (apache#9324)

* [Feature][Metrics] Fix bugs and add grafana demos for worker resource download metrics (apache#9324)

* [Feature][Metrics] Add docs to resource related metrics (apache#9324)

* [Feature][Metrics] Use tags to indicate status in metrics (apache#9324)

* [Feature][Metrics] Fix demos, docs and remove redundant code (apache#9324)

* [Feature][Metrics] Remove .pnpm-debug.log (apache#9324)

* [Feature][Metrics] Fix style check (apache#9324)

* [Feature][Metrics] Replace KB with bytes for the unit of resource file size in metrics (apache#9324)

* [Feature][Metrics] Make code neat (apache#9324)
@EricGao888
Copy link
Member Author

EricGao888 commented Jul 25, 2022

Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~

@caishunfeng
Copy link
Contributor

Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~

I think it's better put into next version, because we are about to release 3.0.0-release, during this time, we only hope to cherry-pick the pr of bugfix.

@EricGao888
Copy link
Member Author

EricGao888 commented Jul 25, 2022

Looks like some PRs related to metrics has not been cherry-picked to 3.0.0-prepare. What about picks them when #10867 merged? @ruanwenjun @caishunfeng @zhongjiajie Thx~

I think it's better put into next version, because we are about to release 3.0.0-release, during this time, we only hope to cherry-pick the pr of bugfix.

Sure, make sense to me. Thx~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DSIP feature new feature help wanted Extra attention is needed metrics
Projects
Status: In Progress
Development

No branches or pull requests

7 participants