-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BE-308] Federated metrics support #2900
Conversation
13bcd5c
to
6096b14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a couple comments, the second commit looks good to me!
|
||
this.stopped = true; | ||
return { | ||
durationNs: durationHrTimeToNanos(process.hrtime(this.startHrTime)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've recently started taking great care to not use Node.js specific APIs, and process.hrtime
certainly falls into that category, which (I suspect among other things, like its dependence on protobuf
) makes apollo-engine-reporting
one of those packages which has to be conditionally require
'd depending on whether the environment supports it.
I don't think we should do anything right now, but just pointing out that it might be worth looking into using performance-now
package which falls back (CoffeeScript!) to performance.now
(browsers) and Date.now()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point and I'll add it as a comment. It's not a regression — we've always used process.hrtime for every JS implementation of timing, and I think it would be bad for people to use Date.now()
for timing since it lacks appropriate precision (and clock skew control).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am underwhelmed by both performance.now()
and performance-now
for this purpose.
I think we definitely don't need to care about supporting engine reporting from browsers. But we may need to care about alternate non-Node server implementations. I can't recall anyone ever complaining that engine didn't work on their server due to lack of process.hrtime, though?
performance.now uses a double instead of a pair of integers and so it has less precision, and even the docs basically say to only expect microsecond precision. (And browser implementations explicitly fudge the numbers to work around Spectre issues.)
performance-now has https://github.com/myrne/performance-now/blob/614099d729aef5d7bdbb85c0d1060166d63d05c7/src/performance-now.coffee#L6-L8 which well — the entire point of process.hrtime
returning a silly pair of integers instead of a single number is that you can't precisely represent nanoseconds in a JS integer! See my comment on durationHrTimeToNanos: after a process uptime of 104 days, this function will start losing precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, I'd be way more interested in a a function with the API of process.hrtime that tries to produce integer nanoseconds from performance.now, than the other way around.
// of GraphQLError are not enumerable and won't show up in the trace | ||
// (even in the json field) anyway.) | ||
// XXX I really don't understand why we don't allow rewriteError | ||
// to rewrite extensions, which do show up in JSON. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extensions
seems like an oversight!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #2932
6096b14
to
df4c52a
Compare
40bd29e
to
2276497
Compare
ec3a614
to
aad8e2f
Compare
Surprise! This is now the federated metrics PR. |
uint64 sent_time_offset = 4; | ||
|
||
// Wallclock times measured in the gateway for when this operation was | ||
// sent and recieved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops typo
} | ||
message FetchNode { | ||
string serviceName = 1; | ||
string operation = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of this until we build UI that needs it. In general we probably shouldn't collect any string
data that we don't immediately present to users, to help them form a better model of what data we collect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment noting that we might want to augment it with more information about the operation at a later date when we build a feature.
// FIXME: do we need requires? | ||
|
||
// This Trace only contains start_time, end_time, duration_ns, and root; | ||
// all timings were calculated on the federated service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be explicit somewhere that all fields on fetchnode are measured on gateway (as an overall statement).
@@ -51,11 +52,14 @@ export interface GraphQLResponse { | |||
} | |||
|
|||
export interface GraphQLRequestMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use comments
|
||
public didEncounterErrors(errors: GraphQLError[]) { | ||
errors.forEach(err => { | ||
// This is an error from a federated service. We will already be reporting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose if we supported nested gateway levels (which won't work now for reporting anyway) then maybe this is wrong. Maybe it should be moved to EngineReportingAgent instead of the tree builder, before the call to the tree builder.
a9d4ece
to
281ad6e
Compare
// in its response. | ||
if (traceNode) { | ||
http = { | ||
headers: new Headers({ 'apollo-federation-include-trace': 'ftv1' }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we're only sending these to the underlying service, but I was pretty sure there was a way to hook into the request context from the gateway to persist incoming headers and pass them along 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abernix or @jbaxleyiii any idea if I'm missing something here? Maybe the path to inject headers into the service is happening elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the hooking in happens at the GrapQLTransport level, so we might have some more work to do here, but I think this means that if users do that, they'll lose metrics tracking rather than losing their transport changes, which we should definitely resolve, but it's terrifying.
2aa7b23
to
14e96cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Augments apollo-engine-reporting and @apollo/gateway to provide federated metrics to Engine from the gateway. Extend trace reports to contain a QueryPlan tree when traces are received from a gateway, mimicking the query plan that the gateway makes internally. FetchNodes on the tree correspond to GraphQL operations executed against federated services; each FetchNode contains its own tree of Trace.Nodes. - apollo-engine-reporting - Factor out core tree building logic to EngineReportingTreeBuilder class - New EngineFederatedTracingExtension adds a base64-encoded protobuf of a Trace (with just the node tree and top level timestamps) to an extension on federated service responses (if requested to do so via an HTTP header from gateway). - Implement new originalFieldName trace field (similar goal to #2401, BE-158). - If @apollo/gateway built us a query plan trace, add it to the trace - @apollo/gateway - Builds up a protobuf representation of the query plan, including sub-traces from federated services on FetchNodes - apollo-engine-reporting-protobuf - Copy reports.proto from internal repo - Set editorconfig settings to match internal repo - Always use normal numbers, not long.js objects - apollo-server-core - Thread some data through between reporting and gateway modules. - apollo-server-integration-testsuite - Clean up some ts warnings TODO record __resolveReference as multiple nodes? TODO is it actually OK to include the operation text in fetch trace nodes? TODO more tests, especially of error cases TODO internal docs (eg FIXME comment in reports.proto) TODO external docs
Previously, the logic within executing a query plan would throw an error and fail the entire operation if it could not parse a trace. Instead, it adds the information that it failed to parse the trace into the overall fetch node itself. This commit also adds some comments.
Build still breaking :(
We used to rely on the error code matching "DOWNSTREAM_SERVICE_ERROR" in order to determine that an error was from an underlying service rather than from the gateway itself. In the past few weeks, error handling in the gateway has been modified to allow for overriding this error code with whatever code has come from the gateway, but we are guaranteed to always have a serviceName if the error comes from a downstream service. Note, however, that this means if users pass "serviceName" in an error extension, it will NOT be reported for metrics, even if it occurred at the Gateway level. We may want to find something less flimsy than this.
ff3863a
to
303d690
Compare
- apollo-cache-control@0.7.6-alpha.10 - apollo-datasource-rest@0.5.2-alpha.1 - apollo-engine-reporting-protobuf@0.3.2-alpha.0 - apollo-engine-reporting@1.4.0-alpha.10 - @apollo/federation@0.6.11-alpha.10 - @apollo/gateway@0.7.0-alpha.10 - apollo-server-azure-functions@2.7.0-alpha.10 - apollo-server-cloud-functions@2.7.0-alpha.10 - apollo-server-cloudflare@2.7.0-alpha.10 - apollo-server-core@2.7.0-alpha.10 - apollo-server-express@2.7.0-alpha.10 - apollo-server-fastify@2.7.0-alpha.10 - apollo-server-hapi@2.7.0-alpha.10 - apollo-server-integration-testsuite@2.7.0-alpha.10 - apollo-server-koa@2.7.0-alpha.10 - apollo-server-lambda@2.7.0-alpha.10 - apollo-server-micro@2.7.0-alpha.10 - apollo-server-plugin-base@0.6.0-alpha.10 - apollo-server-plugin-response-cache@0.2.7-alpha.10 - apollo-server-testing@2.7.0-alpha.10 - apollo-server-types@0.1.1-alpha.1 - apollo-server@2.7.0-alpha.10 - apollo-tracing@0.7.5-alpha.10 - graphql-extensions@0.8.0-alpha.10
- apollo-server-azure-functions@2.7.0-alpha.11 - apollo-server-cloud-functions@2.7.0-alpha.11 - apollo-server-cloudflare@2.7.0-alpha.11 - apollo-server-express@2.7.0-alpha.11 - apollo-server-fastify@2.7.0-alpha.11 - apollo-server-hapi@2.7.0-alpha.11 - apollo-server-integration-testsuite@2.7.0-alpha.11 - apollo-server-koa@2.7.0-alpha.11 - apollo-server-lambda@2.7.0-alpha.11 - apollo-server-micro@2.7.0-alpha.11 - apollo-server-testing@2.7.0-alpha.11 - apollo-server@2.7.0-alpha.11
- apollo-server-azure-functions@2.7.0-alpha.12 - apollo-server-cloud-functions@2.7.0-alpha.12 - apollo-server-cloudflare@2.7.0-alpha.12 - apollo-server-express@2.7.0-alpha.12 - apollo-server-fastify@2.7.0-alpha.12 - apollo-server-hapi@2.7.0-alpha.12 - apollo-server-integration-testsuite@2.7.0-alpha.12 - apollo-server-koa@2.7.0-alpha.12 - apollo-server-lambda@2.7.0-alpha.12 - apollo-server-micro@2.7.0-alpha.12 - apollo-server-testing@2.7.0-alpha.12 - apollo-server@2.7.0-alpha.12
A previous commit #2900 migrated the extension for reporting tracing information to engine into an abstraction called the treeBuilder. The treeBuilder was then utilized by both a federatedExtension and the normal extension, but the two extensions have different end-timing semantics. In the normal extension, the end-timing is captured at the end of the *request*, whereas in the federated extension, the end-timing is captured at the end of the *execution*. The reason for this is that the standard exception does not actually append information into the extensions field, whereas the federated extension does, so serialization and transmission of the response cannot be included in the end of the extension. Due to the limitations of the current requestPipeline, executionDidEnd is what needs to be hooked into to halt overall timing, which is called directly prior to didEncounterErrors. It is likely possible to refactor the request pipeline to support capturing the timing data of serializing errors into the proto state, but since it would be more boolean control-flow to add into the overall request context, this commit simply discards that timing information for reporting in lieu of throwing an error. A subsequent commit will follow up with replacing error throws with logging statements.
A previous commit #2900 migrated the extension for reporting tracing information to engine into an abstraction called the treeBuilder. The treeBuilder was then utilized by both a federatedExtension and the normal extension, but the two extensions have different end-timing semantics. In the normal extension, the end-timing is captured at the end of the *request*, whereas in the federated extension, the end-timing is captured at the end of the *execution*. The reason for this is that the standard exception does not actually append information into the extensions field, whereas the federated extension does, so serialization and transmission of the response cannot be included in the end of the extension. Due to the limitations of the current requestPipeline, executionDidEnd is what needs to be hooked into to halt overall timing, which is called directly prior to didEncounterErrors. It is likely possible to refactor the request pipeline to support capturing the timing data of serializing errors into the proto state, but since it would be more boolean control-flow to add into the overall request context, this commit simply discards that timing information for reporting in lieu of throwing an error. A subsequent commit will follow up with replacing error throws with logging statements.
A previous commit #2900 migrated the extension for reporting tracing information to engine into an abstraction called the treeBuilder. The treeBuilder was then utilized by both a federatedExtension and the normal extension, but the two extensions have different end-timing semantics. In the normal extension, the end-timing is captured at the end of the *request*, whereas in the federated extension, the end-timing is captured at the end of the *execution*. The reason for this is that the standard exception does not actually append information into the extensions field, whereas the federated extension does, so serialization and transmission of the response cannot be included in the end of the extension. Due to the limitations of the current requestPipeline, executionDidEnd is what needs to be hooked into to halt overall timing, which is called directly prior to didEncounterErrors. It is likely possible to refactor the request pipeline to support capturing the timing data of serializing errors into the proto state, but since it would be more boolean control-flow to add into the overall request context, this commit simply discards that timing information for reporting in lieu of throwing an error. A subsequent commit will follow up with replacing error throws with logging statements.
* [BE-308] Federated metrics support Augments apollo-engine-reporting and @apollo/gateway to provide federated metrics to Engine from the gateway. Extend trace reports to contain a QueryPlan tree when traces are received from a gateway, mimicking the query plan that the gateway makes internally. FetchNodes on the tree correspond to GraphQL operations executed against federated services; each FetchNode contains its own tree of Trace.Nodes. - apollo-engine-reporting - Factor out core tree building logic to EngineReportingTreeBuilder class - New EngineFederatedTracingExtension adds a base64-encoded protobuf of a Trace (with just the node tree and top level timestamps) to an extension on federated service responses (if requested to do so via an HTTP header from gateway). - Implement new originalFieldName trace field (similar goal to apollographql/apollo-server#2401, BE-158). - If @apollo/gateway built us a query plan trace, add it to the trace - @apollo/gateway - Builds up a protobuf representation of the query plan, including sub-traces from federated services on FetchNodes - apollo-engine-reporting-protobuf - Copy reports.proto from internal repo - Set editorconfig settings to match internal repo - Always use normal numbers, not long.js objects - apollo-server-core - Thread some data through between reporting and gateway modules. - apollo-server-integration-testsuite - Clean up some ts warnings TODO record __resolveReference as multiple nodes? TODO is it actually OK to include the operation text in fetch trace nodes? TODO more tests, especially of error cases TODO internal docs (eg FIXME comment in reports.proto) TODO external docs * Update protobuf to match cloud repo * No longer set nonexistent operation field * Update proto to fix typo * Convert trace failure into metric Previously, the logic within executing a query plan would throw an error and fail the entire operation if it could not parse a trace. Instead, it adds the information that it failed to parse the trace into the overall fetch node itself. This commit also adds some comments. * add snapshot-serializer test * Add to comment justifying console.warn * update snapshot * prettier * Fixup some imports Build still breaking :( * Rely on serviceName instead of error code We used to rely on the error code matching "DOWNSTREAM_SERVICE_ERROR" in order to determine that an error was from an underlying service rather than from the gateway itself. In the past few weeks, error handling in the gateway has been modified to allow for overriding this error code with whatever code has come from the gateway, but we are guaranteed to always have a serviceName if the error comes from a downstream service. Note, however, that this means if users pass "serviceName" in an error extension, it will NOT be reported for metrics, even if it occurred at the Gateway level. We may want to find something less flimsy than this. * lint fix * Remove invalid null arguments * Take suggested log changes * Publish - apollo-cache-control@0.7.6-alpha.10 - apollo-datasource-rest@0.5.2-alpha.1 - apollo-engine-reporting-protobuf@0.3.2-alpha.0 - apollo-engine-reporting@1.4.0-alpha.10 - @apollo/federation@0.6.11-alpha.10 - @apollo/gateway@0.7.0-alpha.10 - apollo-server-azure-functions@2.7.0-alpha.10 - apollo-server-cloud-functions@2.7.0-alpha.10 - apollo-server-cloudflare@2.7.0-alpha.10 - apollo-server-core@2.7.0-alpha.10 - apollo-server-express@2.7.0-alpha.10 - apollo-server-fastify@2.7.0-alpha.10 - apollo-server-hapi@2.7.0-alpha.10 - apollo-server-integration-testsuite@2.7.0-alpha.10 - apollo-server-koa@2.7.0-alpha.10 - apollo-server-lambda@2.7.0-alpha.10 - apollo-server-micro@2.7.0-alpha.10 - apollo-server-plugin-base@0.6.0-alpha.10 - apollo-server-plugin-response-cache@0.2.7-alpha.10 - apollo-server-testing@2.7.0-alpha.10 - apollo-server-types@0.1.1-alpha.1 - apollo-server@2.7.0-alpha.10 - apollo-tracing@0.7.5-alpha.10 - graphql-extensions@0.8.0-alpha.10 * empty commit * Publish - apollo-server-azure-functions@2.7.0-alpha.11 - apollo-server-cloud-functions@2.7.0-alpha.11 - apollo-server-cloudflare@2.7.0-alpha.11 - apollo-server-express@2.7.0-alpha.11 - apollo-server-fastify@2.7.0-alpha.11 - apollo-server-hapi@2.7.0-alpha.11 - apollo-server-integration-testsuite@2.7.0-alpha.11 - apollo-server-koa@2.7.0-alpha.11 - apollo-server-lambda@2.7.0-alpha.11 - apollo-server-micro@2.7.0-alpha.11 - apollo-server-testing@2.7.0-alpha.11 - apollo-server@2.7.0-alpha.11 * anotha one * Publish - apollo-server-azure-functions@2.7.0-alpha.12 - apollo-server-cloud-functions@2.7.0-alpha.12 - apollo-server-cloudflare@2.7.0-alpha.12 - apollo-server-express@2.7.0-alpha.12 - apollo-server-fastify@2.7.0-alpha.12 - apollo-server-hapi@2.7.0-alpha.12 - apollo-server-integration-testsuite@2.7.0-alpha.12 - apollo-server-koa@2.7.0-alpha.12 - apollo-server-lambda@2.7.0-alpha.12 - apollo-server-micro@2.7.0-alpha.12 - apollo-server-testing@2.7.0-alpha.12 - apollo-server@2.7.0-alpha.12 Apollo-Orig-Commit-AS: apollographql/apollo-server@d8ade4d
Augments apollo-engine-reporting and @apollo/gateway to provide federated
metrics to Engine from the gateway.
Extend trace reports to contain a QueryPlan tree when traces are received from
a gateway, mimicking the query plan that the gateway makes internally.
FetchNodes on the tree correspond to GraphQL operations executed against
federated services; each FetchNode contains its own tree of Trace.Nodes.
apollo-engine-reporting
Trace (with just the node tree and top level timestamps) to an extension
on federated service responses (if requested to do so via an HTTP header
from gateway).
@apollo/gateway
from federated services on FetchNodes
apollo-engine-reporting-protobuf
apollo-server-core
apollo-server-integration-testsuite
TODO record __resolveReference as multiple nodes?
TODO is it actually OK to include the operation text in fetch trace nodes?
TODO more tests, especially of error cases
TODO internal docs (eg FIXME comment in reports.proto)
TODO external docs