[BE-308] Federated metrics support #2900

glasser · 2019-06-21T23:37:22Z

Augments apollo-engine-reporting and @apollo/gateway to provide federated
metrics to Engine from the gateway.

Extend trace reports to contain a QueryPlan tree when traces are received from
a gateway, mimicking the query plan that the gateway makes internally.
FetchNodes on the tree correspond to GraphQL operations executed against
federated services; each FetchNode contains its own tree of Trace.Nodes.

apollo-engine-reporting
- Factor out core tree building logic to EngineReportingTreeBuilder class
- New EngineFederatedTracingExtension adds a base64-encoded protobuf of a
  Trace (with just the node tree and top level timestamps) to an extension
  on federated service responses (if requested to do so via an HTTP header
  from gateway).
- Implement new originalFieldName trace field (similar goal to Send the original field name instead of its alias - BE-158 #2401, BE-158).
- If @apollo/gateway built us a query plan trace, add it to the trace
@apollo/gateway
- Builds up a protobuf representation of the query plan, including sub-traces
  from federated services on FetchNodes
apollo-engine-reporting-protobuf
- Copy reports.proto from internal repo
- Set editorconfig settings to match internal repo
- Always use normal numbers, not long.js objects
apollo-server-core
- Thread some data through between reporting and gateway modules.
apollo-server-integration-testsuite
- Clean up some ts warnings

TODO record __resolveReference as multiple nodes?
TODO is it actually OK to include the operation text in fetch trace nodes?
TODO more tests, especially of error cases
TODO internal docs (eg FIXME comment in reports.proto)
TODO external docs

abernix

With a couple comments, the second commit looks good to me!

abernix · 2019-06-27T11:05:54Z

packages/apollo-engine-reporting/src/treeBuilder.ts

+
+    this.stopped = true;
+    return {
+      durationNs: durationHrTimeToNanos(process.hrtime(this.startHrTime)),


We've recently started taking great care to not use Node.js specific APIs, and process.hrtime certainly falls into that category, which (I suspect among other things, like its dependence on protobuf) makes apollo-engine-reporting one of those packages which has to be conditionally require'd depending on whether the environment supports it.

I don't think we should do anything right now, but just pointing out that it might be worth looking into using performance-now package which falls back (CoffeeScript!) to performance.now (browsers) and Date.now().

This is a good point and I'll add it as a comment. It's not a regression — we've always used process.hrtime for every JS implementation of timing, and I think it would be bad for people to use Date.now() for timing since it lacks appropriate precision (and clock skew control).

I am underwhelmed by both performance.now() and performance-now for this purpose.

I think we definitely don't need to care about supporting engine reporting from browsers. But we may need to care about alternate non-Node server implementations. I can't recall anyone ever complaining that engine didn't work on their server due to lack of process.hrtime, though?

performance.now uses a double instead of a pair of integers and so it has less precision, and even the docs basically say to only expect microsecond precision. (And browser implementations explicitly fudge the numbers to work around Spectre issues.)

performance-now has https://github.com/myrne/performance-now/blob/614099d729aef5d7bdbb85c0d1060166d63d05c7/src/performance-now.coffee#L6-L8 which well — the entire point of process.hrtime returning a silly pair of integers instead of a single number is that you can't precisely represent nanoseconds in a JS integer! See my comment on durationHrTimeToNanos: after a process uptime of 104 days, this function will start losing precision.

In other words, I'd be way more interested in a a function with the API of process.hrtime that tries to produce integer nanoseconds from performance.now, than the other way around.

abernix · 2019-06-27T11:09:00Z

packages/apollo-engine-reporting/src/treeBuilder.ts

+      // of GraphQLError are not enumerable and won't show up in the trace
+      // (even in the json field) anyway.)
+      // XXX I really don't understand why we don't allow rewriteError
+      //     to rewrite extensions, which do show up in JSON.


extensions seems like an oversight!

Fixed in #2932

packages/apollo-engine-reporting/src/treeBuilder.ts

glasser · 2019-06-29T09:09:56Z

Surprise! This is now the federated metrics PR.

glasser · 2019-07-01T18:46:52Z

packages/apollo-engine-reporting-protobuf/src/reports.proto

+			uint64 sent_time_offset = 4;
+
+			// Wallclock times measured in the gateway for when this operation was
+			// sent and recieved.


glasser · 2019-07-01T20:35:42Z

packages/apollo-engine-reporting-protobuf/src/reports.proto

+		}
+		message FetchNode {
+			string serviceName = 1;
+			string operation = 2;


Let's get rid of this until we build UI that needs it. In general we probably shouldn't collect any string data that we don't immediately present to users, to help them form a better model of what data we collect.

Add a comment noting that we might want to augment it with more information about the operation at a later date when we build a feature.

glasser · 2019-07-01T20:38:41Z

packages/apollo-engine-reporting-protobuf/src/reports.proto

+			// FIXME: do we need requires?
+
+			// This Trace only contains start_time, end_time, duration_ns, and root;
+			// all timings were calculated on the federated service.


Be explicit somewhere that all fields on fetchnode are measured on gateway (as an overall statement).

glasser · 2019-07-01T21:00:53Z

packages/apollo-server-core/src/requestPipelineAPI.ts

@@ -51,11 +52,14 @@ export interface GraphQLResponse {
 }

 export interface GraphQLRequestMetrics {


could use comments

glasser · 2019-07-01T21:14:01Z

packages/apollo-engine-reporting/src/treeBuilder.ts

+
+  public didEncounterErrors(errors: GraphQLError[]) {
+    errors.forEach(err => {
+      // This is an error from a federated service. We will already be reporting


I suppose if we supported nested gateway levels (which won't work now for reporting anyway) then maybe this is wrong. Maybe it should be moved to EngineReportingAgent instead of the tree builder, before the call to the tree builder.

packages/apollo-engine-reporting/src/treeBuilder.ts

packages/apollo-gateway/src/executeQueryPlan.ts

zionts · 2019-07-04T01:27:44Z

packages/apollo-gateway/src/executeQueryPlan.ts

+    // in its response.
+    if (traceNode) {
+      http = {
+        headers: new Headers({ 'apollo-federation-include-trace': 'ftv1' }),


It seems like we're only sending these to the underlying service, but I was pretty sure there was a way to hook into the request context from the gateway to persist incoming headers and pass them along 🤔

@abernix or @jbaxleyiii any idea if I'm missing something here? Maybe the path to inject headers into the service is happening elsewhere.

I think the hooking in happens at the GrapQLTransport level, so we might have some more work to do here, but I think this means that if users do that, they'll lose metrics tracking rather than losing their transport changes, which we should definitely resolve, but it's terrifying.

zionts

Augments apollo-engine-reporting and @apollo/gateway to provide federated metrics to Engine from the gateway. Extend trace reports to contain a QueryPlan tree when traces are received from a gateway, mimicking the query plan that the gateway makes internally. FetchNodes on the tree correspond to GraphQL operations executed against federated services; each FetchNode contains its own tree of Trace.Nodes. - apollo-engine-reporting - Factor out core tree building logic to EngineReportingTreeBuilder class - New EngineFederatedTracingExtension adds a base64-encoded protobuf of a Trace (with just the node tree and top level timestamps) to an extension on federated service responses (if requested to do so via an HTTP header from gateway). - Implement new originalFieldName trace field (similar goal to #2401, BE-158). - If @apollo/gateway built us a query plan trace, add it to the trace - @apollo/gateway - Builds up a protobuf representation of the query plan, including sub-traces from federated services on FetchNodes - apollo-engine-reporting-protobuf - Copy reports.proto from internal repo - Set editorconfig settings to match internal repo - Always use normal numbers, not long.js objects - apollo-server-core - Thread some data through between reporting and gateway modules. - apollo-server-integration-testsuite - Clean up some ts warnings TODO record __resolveReference as multiple nodes? TODO is it actually OK to include the operation text in fetch trace nodes? TODO more tests, especially of error cases TODO internal docs (eg FIXME comment in reports.proto) TODO external docs

Previously, the logic within executing a query plan would throw an error and fail the entire operation if it could not parse a trace. Instead, it adds the information that it failed to parse the trace into the overall fetch node itself. This commit also adds some comments.

Build still breaking :(

We used to rely on the error code matching "DOWNSTREAM_SERVICE_ERROR" in order to determine that an error was from an underlying service rather than from the gateway itself. In the past few weeks, error handling in the gateway has been modified to allow for overriding this error code with whatever code has come from the gateway, but we are guaranteed to always have a serviceName if the error comes from a downstream service. Note, however, that this means if users pass "serviceName" in an error extension, it will NOT be reported for metrics, even if it occurred at the Gateway level. We may want to find something less flimsy than this.

- apollo-cache-control@0.7.6-alpha.10 - apollo-datasource-rest@0.5.2-alpha.1 - apollo-engine-reporting-protobuf@0.3.2-alpha.0 - apollo-engine-reporting@1.4.0-alpha.10 - @apollo/federation@0.6.11-alpha.10 - @apollo/gateway@0.7.0-alpha.10 - apollo-server-azure-functions@2.7.0-alpha.10 - apollo-server-cloud-functions@2.7.0-alpha.10 - apollo-server-cloudflare@2.7.0-alpha.10 - apollo-server-core@2.7.0-alpha.10 - apollo-server-express@2.7.0-alpha.10 - apollo-server-fastify@2.7.0-alpha.10 - apollo-server-hapi@2.7.0-alpha.10 - apollo-server-integration-testsuite@2.7.0-alpha.10 - apollo-server-koa@2.7.0-alpha.10 - apollo-server-lambda@2.7.0-alpha.10 - apollo-server-micro@2.7.0-alpha.10 - apollo-server-plugin-base@0.6.0-alpha.10 - apollo-server-plugin-response-cache@0.2.7-alpha.10 - apollo-server-testing@2.7.0-alpha.10 - apollo-server-types@0.1.1-alpha.1 - apollo-server@2.7.0-alpha.10 - apollo-tracing@0.7.5-alpha.10 - graphql-extensions@0.8.0-alpha.10

- apollo-server-azure-functions@2.7.0-alpha.11 - apollo-server-cloud-functions@2.7.0-alpha.11 - apollo-server-cloudflare@2.7.0-alpha.11 - apollo-server-express@2.7.0-alpha.11 - apollo-server-fastify@2.7.0-alpha.11 - apollo-server-hapi@2.7.0-alpha.11 - apollo-server-integration-testsuite@2.7.0-alpha.11 - apollo-server-koa@2.7.0-alpha.11 - apollo-server-lambda@2.7.0-alpha.11 - apollo-server-micro@2.7.0-alpha.11 - apollo-server-testing@2.7.0-alpha.11 - apollo-server@2.7.0-alpha.11

- apollo-server-azure-functions@2.7.0-alpha.12 - apollo-server-cloud-functions@2.7.0-alpha.12 - apollo-server-cloudflare@2.7.0-alpha.12 - apollo-server-express@2.7.0-alpha.12 - apollo-server-fastify@2.7.0-alpha.12 - apollo-server-hapi@2.7.0-alpha.12 - apollo-server-integration-testsuite@2.7.0-alpha.12 - apollo-server-koa@2.7.0-alpha.12 - apollo-server-lambda@2.7.0-alpha.12 - apollo-server-micro@2.7.0-alpha.12 - apollo-server-testing@2.7.0-alpha.12 - apollo-server@2.7.0-alpha.12

A previous commit #2900 migrated the extension for reporting tracing information to engine into an abstraction called the treeBuilder. The treeBuilder was then utilized by both a federatedExtension and the normal extension, but the two extensions have different end-timing semantics. In the normal extension, the end-timing is captured at the end of the *request*, whereas in the federated extension, the end-timing is captured at the end of the *execution*. The reason for this is that the standard exception does not actually append information into the extensions field, whereas the federated extension does, so serialization and transmission of the response cannot be included in the end of the extension. Due to the limitations of the current requestPipeline, executionDidEnd is what needs to be hooked into to halt overall timing, which is called directly prior to didEncounterErrors. It is likely possible to refactor the request pipeline to support capturing the timing data of serializing errors into the proto state, but since it would be more boolean control-flow to add into the overall request context, this commit simply discards that timing information for reporting in lieu of throwing an error. A subsequent commit will follow up with replacing error throws with logging statements.

* [BE-308] Federated metrics support Augments apollo-engine-reporting and @apollo/gateway to provide federated metrics to Engine from the gateway. Extend trace reports to contain a QueryPlan tree when traces are received from a gateway, mimicking the query plan that the gateway makes internally. FetchNodes on the tree correspond to GraphQL operations executed against federated services; each FetchNode contains its own tree of Trace.Nodes. - apollo-engine-reporting - Factor out core tree building logic to EngineReportingTreeBuilder class - New EngineFederatedTracingExtension adds a base64-encoded protobuf of a Trace (with just the node tree and top level timestamps) to an extension on federated service responses (if requested to do so via an HTTP header from gateway). - Implement new originalFieldName trace field (similar goal to apollographql/apollo-server#2401, BE-158). - If @apollo/gateway built us a query plan trace, add it to the trace - @apollo/gateway - Builds up a protobuf representation of the query plan, including sub-traces from federated services on FetchNodes - apollo-engine-reporting-protobuf - Copy reports.proto from internal repo - Set editorconfig settings to match internal repo - Always use normal numbers, not long.js objects - apollo-server-core - Thread some data through between reporting and gateway modules. - apollo-server-integration-testsuite - Clean up some ts warnings TODO record __resolveReference as multiple nodes? TODO is it actually OK to include the operation text in fetch trace nodes? TODO more tests, especially of error cases TODO internal docs (eg FIXME comment in reports.proto) TODO external docs * Update protobuf to match cloud repo * No longer set nonexistent operation field * Update proto to fix typo * Convert trace failure into metric Previously, the logic within executing a query plan would throw an error and fail the entire operation if it could not parse a trace. Instead, it adds the information that it failed to parse the trace into the overall fetch node itself. This commit also adds some comments. * add snapshot-serializer test * Add to comment justifying console.warn * update snapshot * prettier * Fixup some imports Build still breaking :( * Rely on serviceName instead of error code We used to rely on the error code matching "DOWNSTREAM_SERVICE_ERROR" in order to determine that an error was from an underlying service rather than from the gateway itself. In the past few weeks, error handling in the gateway has been modified to allow for overriding this error code with whatever code has come from the gateway, but we are guaranteed to always have a serviceName if the error comes from a downstream service. Note, however, that this means if users pass "serviceName" in an error extension, it will NOT be reported for metrics, even if it occurred at the Gateway level. We may want to find something less flimsy than this. * lint fix * Remove invalid null arguments * Take suggested log changes * Publish - apollo-cache-control@0.7.6-alpha.10 - apollo-datasource-rest@0.5.2-alpha.1 - apollo-engine-reporting-protobuf@0.3.2-alpha.0 - apollo-engine-reporting@1.4.0-alpha.10 - @apollo/federation@0.6.11-alpha.10 - @apollo/gateway@0.7.0-alpha.10 - apollo-server-azure-functions@2.7.0-alpha.10 - apollo-server-cloud-functions@2.7.0-alpha.10 - apollo-server-cloudflare@2.7.0-alpha.10 - apollo-server-core@2.7.0-alpha.10 - apollo-server-express@2.7.0-alpha.10 - apollo-server-fastify@2.7.0-alpha.10 - apollo-server-hapi@2.7.0-alpha.10 - apollo-server-integration-testsuite@2.7.0-alpha.10 - apollo-server-koa@2.7.0-alpha.10 - apollo-server-lambda@2.7.0-alpha.10 - apollo-server-micro@2.7.0-alpha.10 - apollo-server-plugin-base@0.6.0-alpha.10 - apollo-server-plugin-response-cache@0.2.7-alpha.10 - apollo-server-testing@2.7.0-alpha.10 - apollo-server-types@0.1.1-alpha.1 - apollo-server@2.7.0-alpha.10 - apollo-tracing@0.7.5-alpha.10 - graphql-extensions@0.8.0-alpha.10 * empty commit * Publish - apollo-server-azure-functions@2.7.0-alpha.11 - apollo-server-cloud-functions@2.7.0-alpha.11 - apollo-server-cloudflare@2.7.0-alpha.11 - apollo-server-express@2.7.0-alpha.11 - apollo-server-fastify@2.7.0-alpha.11 - apollo-server-hapi@2.7.0-alpha.11 - apollo-server-integration-testsuite@2.7.0-alpha.11 - apollo-server-koa@2.7.0-alpha.11 - apollo-server-lambda@2.7.0-alpha.11 - apollo-server-micro@2.7.0-alpha.11 - apollo-server-testing@2.7.0-alpha.11 - apollo-server@2.7.0-alpha.11 * anotha one * Publish - apollo-server-azure-functions@2.7.0-alpha.12 - apollo-server-cloud-functions@2.7.0-alpha.12 - apollo-server-cloudflare@2.7.0-alpha.12 - apollo-server-express@2.7.0-alpha.12 - apollo-server-fastify@2.7.0-alpha.12 - apollo-server-hapi@2.7.0-alpha.12 - apollo-server-integration-testsuite@2.7.0-alpha.12 - apollo-server-koa@2.7.0-alpha.12 - apollo-server-lambda@2.7.0-alpha.12 - apollo-server-micro@2.7.0-alpha.12 - apollo-server-testing@2.7.0-alpha.12 - apollo-server@2.7.0-alpha.12 Apollo-Orig-Commit-AS: apollographql/apollo-server@d8ade4d

glasser requested a review from zionts June 21, 2019 23:37

glasser force-pushed the glasser/tree-builder branch 4 times, most recently from 13bcd5c to 6096b14 Compare June 27, 2019 00:50

abernix approved these changes Jun 27, 2019

View reviewed changes

glasser force-pushed the glasser/tree-builder branch from 6096b14 to df4c52a Compare June 27, 2019 18:55

glasser changed the base branch from master to release-2.7.0 June 27, 2019 18:55

glasser force-pushed the glasser/tree-builder branch from 40bd29e to 2276497 Compare June 27, 2019 23:23

glasser mentioned this pull request Jun 29, 2019

Send the original field name instead of its alias - BE-158 #2401

Closed

4 tasks

glasser force-pushed the glasser/tree-builder branch from ec3a614 to aad8e2f Compare June 29, 2019 09:08

glasser changed the title ~~apollo-engine-reporting: Factor out node creation from extension~~ [BE-308] Federated metrics support Jun 29, 2019

glasser commented Jul 1, 2019

View reviewed changes

packages/apollo-engine-reporting/src/treeBuilder.ts Outdated Show resolved Hide resolved

glasser commented Jul 1, 2019

View reviewed changes

packages/apollo-gateway/src/executeQueryPlan.ts Show resolved Hide resolved

glasser commented Jul 1, 2019

View reviewed changes

packages/apollo-gateway/src/executeQueryPlan.ts Show resolved Hide resolved

glasser mentioned this pull request Jul 1, 2019

Staging PR for Gateway Integration #2940

Merged

glasser force-pushed the glasser/tree-builder branch 3 times, most recently from a9d4ece to 281ad6e Compare July 3, 2019 22:04

zionts reviewed Jul 10, 2019

View reviewed changes

zionts force-pushed the glasser/tree-builder branch from 2aa7b23 to 14e96cd Compare July 13, 2019 01:01

zionts approved these changes Jul 14, 2019

View reviewed changes

glasser and others added 2 commits July 14, 2019 23:57

Update protobuf to match cloud repo

e14e496

Adam Zionts and others added 12 commits July 14, 2019 23:57

No longer set nonexistent operation field

fd8c6d3

Update proto to fix typo

2861ca8

add snapshot-serializer test

b3c816d

Add to comment justifying console.warn

42a7417

update snapshot

ba0f851

prettier

36f5d02

Fixup some imports

af1a1f9

Build still breaking :(

lint fix

b61d15d

Remove invalid null arguments

f88b497

Take suggested log changes

303d690

zionts force-pushed the glasser/tree-builder branch from ff3863a to 303d690 Compare July 15, 2019 06:58

Adam Zionts added 5 commits July 14, 2019 23:58

empty commit

fa2d287

anotha one

ef74659

zionts merged commit d8ade4d into release-2.7.0 Jul 15, 2019

zionts deleted the glasser/tree-builder branch July 15, 2019 07:59

zionts mentioned this pull request Jul 18, 2019

apollo-engine-reporting: fix reporting errors from backends #3056

Merged

hotgazpacho mentioned this pull request Aug 12, 2019

Lambda package size #2162

Closed

glasser restored the glasser/tree-builder branch March 13, 2020 21:27

glasser mentioned this pull request May 20, 2020

feat(reporting): Add reportTiming option to EngineReportingOptions #3918

Merged

abernix deleted the glasser/tree-builder branch February 5, 2021 07:05

github-actions bot locked as resolved and limited conversation to collaborators Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BE-308] Federated metrics support #2900

[BE-308] Federated metrics support #2900

glasser commented Jun 21, 2019 •

edited

Loading

abernix left a comment

abernix Jun 27, 2019

glasser Jun 27, 2019

glasser Jun 27, 2019

glasser Jun 27, 2019

abernix Jun 27, 2019

glasser Jun 27, 2019

glasser commented Jun 29, 2019

glasser Jul 1, 2019

glasser Jul 1, 2019

glasser Jul 1, 2019

glasser Jul 1, 2019

glasser Jul 1, 2019

glasser Jul 1, 2019

zionts Jul 4, 2019

zionts Jul 4, 2019

zionts Jul 14, 2019

zionts left a comment

		@@ -51,11 +52,14 @@ export interface GraphQLResponse {
		}

		export interface GraphQLRequestMetrics {

[BE-308] Federated metrics support #2900

[BE-308] Federated metrics support #2900

Conversation

glasser commented Jun 21, 2019 • edited Loading

abernix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glasser commented Jun 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zionts left a comment

Choose a reason for hiding this comment

glasser commented Jun 21, 2019 •

edited

Loading