Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add attempt metrics #21286

Merged
merged 7 commits into from
Jan 18, 2023
Merged

add attempt metrics #21286

merged 7 commits into from
Jan 18, 2023

Conversation

cgardens
Copy link
Contributor

@cgardens cgardens commented Jan 12, 2023

What

I want to be able to build this dashboard (below). I want to know what the failure rate is with connector failures removed. Ideally this is what we would use to trigger pages for the platform team, instead of triggering based on failure rate with connectors included.
Screen Shot 2023-01-11 at 5 26 29 PM (link)

I can't build it because we don't emit one event per attempt or job. We emit n events based on the number of release stages and failure origins. Because of this, I can't really get numbers that reflect the reality. They get skewed by the number of stages and failure origins. I end up with a result I don't really trust.

In addition, I find dashboards like this one that are wrong, because they double count jobs since for each job there could be 1 or 2 release stages.

How

  • I am proposing emitting new events for attempts. There will be 1 created and 1 completed event for each attempt.
  • I get around the multiple release stage issue by having a min and max release stage field. Any graph we could build with the old one we could build using this new scheme and not have confusing cardinality issues.
  • For failure origin and failure type, just emit the first one (if present).
  • Suggests normalizing the way we refer to data planes. Using just geography isn't going to cut it. We need something. Not too picky on what.
  • Added a constructor in OssMetricRepository so that we can declare what dimensions we expect to ship with a metric. Not sure this is a good idea yet. The declaration is good, but it currently can't be enforced.
  • I did it for attempts and not jobs (at least for now) because that's really what I need for the current dashboard, but we could do the same for jobs if we like it.
  • Skipped doing pending since it's not really meaningful. I would love to figure out how to get a meaningful pending metric though.

To d

  • Still need to wire up the metrics to actually get emitted. Will do this if I get a 👍 on the approach.

OssMetricsRegistry(final MetricEmittingApp application,
final String metricName,
final String metricDescription) {
final String metricDescription,
final String... metricTags) {
Copy link
Contributor

@davinchia davinchia Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this idea even though we can't strictly enforce it. Perhaps we can do slightly better by turning this into a list of metric tags and having a static constant named NO_METRIC_TAGS to make it obvious.

Any bit of discoverability or readability helps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davinchia That would add additional logic to check the list to see if it only has one entry and it's value is NO_METRIC_TAGS. That seems more complicated than using the ellipses to indicate that the list/array can simply not be provided if there are no tags. This also aligns with how the OpenTracing/DataDog tracing classes work.

Copy link
Contributor

@davinchia davinchia Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I wasn't even thinking of going that far.

I was suggesting that we add a sentinel variable and enforce via convention that the metrics with no tags are initialised with new OssMericsRegistry(application, name, description, NO_TAG) vs new OssMericsRegistry(application, name, description). I like this because in the empty case, I no longer need to click into the constructor, read javadocs and understand what the variable arguments mean.

@davinchia
Copy link
Contributor

Sounds reasonable. I like the min/max approach. I think we should give it a shot. It's a 2-way door so should be easy to try and pick what we like best.

Curious, why do you say pending isn't meaningful? If I recall correctly, that alert has helped us in the past when there are issues with the worker deployment.

Copy link
Contributor

@jdpgrailsdev jdpgrailsdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@cgardens
Copy link
Contributor Author

@davinchia

Curious, why do you say pending isn't meaningful? If I recall correctly, that alert has helped us in the past when there are issues with the worker deployment.

      workflowInternalState.setJobId(getOrCreateJobId(connectionUpdaterInput));
      workflowInternalState.setAttemptNumber(createAttempt(workflowInternalState.getJobId()));

A job is set to pending on line 208 of ConnectionManagerWorkflowImpl. It is then set to started on line 209. So all pending tells us is that it line 208 has executed but not 209, which doesn't seem all that useful. 😛 It would be great to have a way of actually tracking pending, aka we have a job we want to schedule but it hasn't started yet.

@cgardens cgardens requested a review from a team as a code owner January 14, 2023 01:37
@cgardens cgardens requested a review from davinchia January 14, 2023 01:38
@octavia-squidington-iv octavia-squidington-iv added area/platform issues related to the platform area/worker Related to worker labels Jan 14, 2023
@cgardens
Copy link
Contributor Author

cgardens commented Jan 14, 2023

Update: I removed the running attempts metric. Because we use a gauge, it doesn't work well to add several dimensions. For now I think nailing the creation and completion is really the key anyway. The rest is all wired up now.

I removed data plane id for now. I think that should added as part of the data plane portability project, so I don't want to preempt that.

@jdpgrailsdev mind taking one more look?

@cgardens cgardens temporarily deployed to more-secrets January 14, 2023 01:39 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 14, 2023 01:39 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 14, 2023 01:40 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 14, 2023 01:40 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 14, 2023 02:01 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 14, 2023 02:01 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 14, 2023 02:03 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 14, 2023 02:03 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 14, 2023 02:03 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 14, 2023 02:03 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 15, 2023 02:43 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 15, 2023 02:43 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 15, 2023 02:43 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 15, 2023 02:43 — with GitHub Actions Inactive
@github-actions
Copy link
Contributor

File Coverage [91.7%] 🍏
OssMetricsRegistry.java 99.29% 🍏
DefaultJobPersistence.java 95.92% 🍏
MoreLists.java 86.36% 🍏
Attempt.java 85.25% 🍏
Job.java 84.22% 🍏
JobCreationAndStatusUpdateActivityImpl.java 83.95% 🍏
MetricTags.java 0%
Total Project Coverage 26.69% 🍏

@cgardens cgardens force-pushed the cgardens/attempt-tracking branch from 00f2cb1 to 459a53f Compare January 15, 2023 23:55
@cgardens cgardens temporarily deployed to more-secrets January 15, 2023 23:57 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 15, 2023 23:57 — with GitHub Actions Inactive
@github-actions
Copy link
Contributor

github-actions bot commented Jan 16, 2023

Airbyte Code Coverage

File Coverage [91.64%] 🍏
OssMetricsRegistry.java 99.29% 🍏
DefaultJobPersistence.java 95.92% 🍏
MoreLists.java 86.36% 🍏
Attempt.java 85.25% 🍏
Job.java 84.22% 🍏
JobCreationAndStatusUpdateActivityImpl.java 83.71% 🍏
MetricTags.java 0%
Total Project Coverage 26.71% 🍏

@cgardens cgardens temporarily deployed to more-secrets January 17, 2023 20:07 — with GitHub Actions Inactive
@cgardens cgardens temporarily deployed to more-secrets January 17, 2023 20:07 — with GitHub Actions Inactive
@cgardens cgardens merged commit 9415eb5 into master Jan 18, 2023
@cgardens cgardens deleted the cgardens/attempt-tracking branch January 18, 2023 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/server area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants