add counter metrics for remote execution #11155

tdyas · 2020-11-12T08:12:16Z

Add counters to the remote execution code to provide greater visibility into its operation.

[ci skip-build-wheels]

coveralls · 2020-11-12T08:59:10Z

Coverage remained the same at 0.0% when pulling dd46836 on tdyas:add_remote_execution_counters into 91e82e2 on pantsbuild:master.

stuhood · 2020-11-12T20:36:50Z

src/rust/engine/process_execution/src/remote.rs

+            context
+              .workunit_store
+              .increment_counter(Metric::RemoteExecutionErrors, 1);


RpcFailure won't increment this one... is that a lower level error, or should this increment move up between Err(err) => and match err {..} ?

The counter will be incremented in the code that follows that match. Note that the RpcFailure match arm just deconstructs the error and turns it into a OperationOrStatus::Status instance.

stuhood · 2020-11-12T20:41:08Z

src/rust/engine/workunit_store/src/metrics.rs

@@ -35,7 +35,12 @@ pub enum Metric {
  LocalCacheReadErrors,
  LocalCacheWriteErrors,
  LocalExecutionRequests,
+  RemoteExecutionErrors,


This one is incremented for every RPC error, rather than necessarily for every corresponding RemoteExecutionRequest, which means it's sortof at a different level of abstraction. It might be good to namespace them differently and have these both be paired: "RPC attempt, success/fail" vs "execution attempt, success/fail"?

One thing that helps to encourage this in a lot of metrics APIs is to have a context-manager/wrapper that records both the attempt, and the success/failure as a pair. That's a little bit more challenging to use here since some of the "errors" are actually recoverable, and thus not counted that way, but just a thought.

Good idea. I already increment a counter per RPC (RemoteExecutionMethodExecute and RemoteExecutionMethodWaitExecution depending on which method it is). I can rename those s/Method/RPC/ and have the error be "RPCError."

Then I can move the overall execution request counters (with its own error counter) up into remote::CommandRunner.

[ci skip-build-wheels]

Internal only changes: * Rewrite changelog helper to Python and use markdown for changelog ([#11224](#11224)) * Update to nails `0.8.0`. ([#11226](#11226)) * upgrade to Rust v1.48.0 ([#11210](#11210)) * Move Session to its own file. ([#11222](#11222)) * Port InteractiveProcess to Tokio to unblock async interruption ([#11219](#11219)) * Add `./cargo` script to facilitate Rust development ([#11218](#11218)) * Add partition metadata ([#11212](#11212)) * Hotfix unbound `$PY` variable breaking cargo script ([#11209](#11209)) * Remove dataclasses backport ([#11208](#11208)) * Stop using Python 3.6 internally and in CI ([#11175](#11175)) * Update to tokio 0.2.23. ([#11200](#11200)) * Add test address metadata ([#11193](#11193)) * Port more tests from `TestBase` to `RuleRunner` ([#11183](#11183)) * acquire GIL during work unit conversion ([#11186](#11186)) * Add counter metrics for remote execution ([#11155](#11155)) * add metrics to local cache code paths ([#11146](#11146)) * Deprecate the `sources` field for `python_awslambda` ([#11176](#11176)) * Make S3 bucket expiry deletion handling robust. ([#11156](#11156)) * Detect delete markers in our s3 cache of native_engine.so. ([#11140](#11140)) * Fix target globs for deleted file `pants_run_integration_test.py`. ([#11127](#11127)) * Deduplicate `Platform` and `PlatformConstraint` ([#11157](#11157)) * move counter increment into check_action_cache ([#11154](#11154)) * Replace the "base target" concept with "BUILD targets" ([#11136](#11136)) * Re-land the port of Pants' nailgun client to Rust ([#11147](#11147)) * Flatten execution and target PlatformConstraints in MultiPlatformProcess ([#11145](#11145)) * Prepare 2.1.0rc2 ([#11180](#11180)) * Prepare 2.1.0rc1 ([#11144](#11144)) * Prepare 2.0.1rc0 ([#11141](#11141)) * Prepare 2.0.1rc2 ([#11220](#11220)) * Prepare 2.1.1rc0 ([#11221](#11221)) * Prepare 2.1.0 ([#11198](#11198)) * Prepare 2.0.1rc1 ([#11192](#11192)) * Prepare 2.1.0rc3 ([#11191](#11191)) [ci skip-rust]

tdyas requested review from stuhood, illicitonion and Eric-Arellano November 12, 2020 08:12

stuhood approved these changes Nov 12, 2020

View reviewed changes

tdyas force-pushed the add_remote_execution_counters branch from 78a61eb to f2b2e11 Compare November 13, 2020 23:00

Tom Dyas added 5 commits November 13, 2020 22:14

remote execution metrics

189637c

[ci skip-build-wheels]

rename RPC-related metrics to include RPC in the name

051dd98

[ci skip-build-wheels]

add request-related counters

85b76d1

[ci skip-build-wheels]

rename rpc metric

23b318a

[ci skip-build-wheels]

alphabetical order

dd46836

[ci skip-build-wheels]

tdyas force-pushed the add_remote_execution_counters branch from f2b2e11 to dd46836 Compare November 14, 2020 06:14

tdyas merged commit f084d4f into pantsbuild:master Nov 14, 2020

tdyas deleted the add_remote_execution_counters branch November 14, 2020 07:43

Eric-Arellano mentioned this pull request Nov 24, 2020

Prepare 2.2.0.dev0 #11239

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add counter metrics for remote execution #11155

add counter metrics for remote execution #11155

tdyas commented Nov 12, 2020

coveralls commented Nov 12, 2020 •

edited

Loading

stuhood Nov 12, 2020

tdyas Nov 13, 2020

stuhood Nov 12, 2020

tdyas Nov 13, 2020

add counter metrics for remote execution #11155

add counter metrics for remote execution #11155

Conversation

tdyas commented Nov 12, 2020

coveralls commented Nov 12, 2020 • edited Loading

stuhood Nov 12, 2020

Choose a reason for hiding this comment

tdyas Nov 13, 2020

Choose a reason for hiding this comment

stuhood Nov 12, 2020

Choose a reason for hiding this comment

tdyas Nov 13, 2020

Choose a reason for hiding this comment

coveralls commented Nov 12, 2020 •

edited

Loading