Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add counter metrics for remote execution #11155

Merged
merged 5 commits into from
Nov 14, 2020

Conversation

tdyas
Copy link
Contributor

@tdyas tdyas commented Nov 12, 2020

Add counters to the remote execution code to provide greater visibility into its operation.

[ci skip-build-wheels]

@coveralls
Copy link

coveralls commented Nov 12, 2020

Coverage Status

Coverage remained the same at 0.0% when pulling dd46836 on tdyas:add_remote_execution_counters into 91e82e2 on pantsbuild:master.

Comment on lines 642 to 644
context
.workunit_store
.increment_counter(Metric::RemoteExecutionErrors, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RpcFailure won't increment this one... is that a lower level error, or should this increment move up between Err(err) => and match err {..} ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The counter will be incremented in the code that follows that match. Note that the RpcFailure match arm just deconstructs the error and turns it into a OperationOrStatus::Status instance.

@@ -35,7 +35,12 @@ pub enum Metric {
LocalCacheReadErrors,
LocalCacheWriteErrors,
LocalExecutionRequests,
RemoteExecutionErrors,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is incremented for every RPC error, rather than necessarily for every corresponding RemoteExecutionRequest, which means it's sortof at a different level of abstraction. It might be good to namespace them differently and have these both be paired: "RPC attempt, success/fail" vs "execution attempt, success/fail"?

One thing that helps to encourage this in a lot of metrics APIs is to have a context-manager/wrapper that records both the attempt, and the success/failure as a pair. That's a little bit more challenging to use here since some of the "errors" are actually recoverable, and thus not counted that way, but just a thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I already increment a counter per RPC (RemoteExecutionMethodExecute and RemoteExecutionMethodWaitExecution depending on which method it is). I can rename those s/Method/RPC/ and have the error be "RPCError."

Then I can move the overall execution request counters (with its own error counter) up into remote::CommandRunner.

@tdyas tdyas force-pushed the add_remote_execution_counters branch from 78a61eb to f2b2e11 Compare November 13, 2020 23:00
Tom Dyas added 5 commits November 13, 2020 22:14
[ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-build-wheels]
@tdyas tdyas force-pushed the add_remote_execution_counters branch from f2b2e11 to dd46836 Compare November 14, 2020 06:14
@tdyas tdyas merged commit f084d4f into pantsbuild:master Nov 14, 2020
@tdyas tdyas deleted the add_remote_execution_counters branch November 14, 2020 07:43
Eric-Arellano added a commit that referenced this pull request Nov 25, 2020
Internal only changes:

* Rewrite changelog helper to Python and use markdown for changelog ([#11224](#11224))

* Update to nails `0.8.0`. ([#11226](#11226))

* upgrade to Rust v1.48.0 ([#11210](#11210))

* Move Session to its own file. ([#11222](#11222))

* Port InteractiveProcess to Tokio to unblock async interruption ([#11219](#11219))

* Add `./cargo` script to facilitate Rust development ([#11218](#11218))

* Add partition metadata ([#11212](#11212))

* Hotfix unbound `$PY` variable breaking cargo script ([#11209](#11209))

* Remove dataclasses backport ([#11208](#11208))

* Stop using Python 3.6 internally and in CI ([#11175](#11175))

* Update to tokio 0.2.23. ([#11200](#11200))

* Add test address metadata ([#11193](#11193))

* Port more tests from `TestBase` to `RuleRunner` ([#11183](#11183))

* acquire GIL during work unit conversion ([#11186](#11186))

* Add counter metrics for remote execution  ([#11155](#11155))

* add metrics to local cache code paths ([#11146](#11146))

* Deprecate the `sources` field for `python_awslambda` ([#11176](#11176))

* Make S3 bucket expiry deletion handling robust. ([#11156](#11156))

* Detect delete markers in our s3 cache of native_engine.so. ([#11140](#11140))

* Fix target globs for deleted file `pants_run_integration_test.py`. ([#11127](#11127))

* Deduplicate `Platform` and `PlatformConstraint` ([#11157](#11157))

* move counter increment into check_action_cache ([#11154](#11154))

* Replace the "base target" concept with "BUILD targets" ([#11136](#11136))

* Re-land the port of Pants' nailgun client to Rust ([#11147](#11147))

* Flatten execution and target PlatformConstraints in MultiPlatformProcess ([#11145](#11145))

* Prepare 2.1.0rc2 ([#11180](#11180))

* Prepare 2.1.0rc1 ([#11144](#11144))

* Prepare 2.0.1rc0 ([#11141](#11141))

* Prepare 2.0.1rc2 ([#11220](#11220))

* Prepare 2.1.1rc0 ([#11221](#11221))

* Prepare 2.1.0 ([#11198](#11198))

* Prepare 2.0.1rc1 ([#11192](#11192))

* Prepare 2.1.0rc3 ([#11191](#11191))

[ci skip-rust]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants