Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Studio usage reporting to support referenced fields in addition to executed fields #5708

Closed
hwillson opened this issue Sep 14, 2021 · 6 comments · Fixed by #5956
Closed
Assignees
Milestone

Comments

@hwillson
Copy link
Member

hwillson commented Sep 14, 2021

In Apollo Studio, the "Fields" page lets you see how often your fields are executed by operations — ie, how often their resolvers run. It additionally lets you see which clients and operations ran those operations.

However, knowing if an operation executed a field doesn't tell the whole story about the relationship between operations and fields. You might also be curious to learn if a field was textually referenced in the operation itself.

It's possible for a field to be referenced without executing. In fact, there are many reasons this can happen:

  • The field is nested under another field which evaluates to null
  • The field is nested under another field which evaluates to an empty list
  • The field is nested under a non-matching fragment
  • The field is nested under @include or @skip

If you're using the Fields page to determine "what are all the operations that use this field", you probably want to know about these usages too!

It's also possible for a field to be executed without being referenced. That's because when we track "executed fields", we are always tracking the concrete object type that is being executed, not an interface type it may have been resolved through. So with schema:

interface Animal {
  legs: Int
}
type Dog implements Animal {
  legs: Int
}
type Query {
  myFavoriteAnimal: Animal  # In practice, always returns a Dog
}

the operation { myFavoriteAnimal { legs } } "references" Animal.legs, not Dog.legs, but it is the field Dog.legs that is executed.

We aim to improve Apollo Studio and the apollo-server usage reporting so that the Fields page displays both the number of executions and the number of textual references.

In addition to this just being an interesting thing to ask about, this will also mean that the Fields page can be useful without needing full execution tracing. There is a real performance impact of full execution tracing, especially in the federation context where the ftv1 protocol involves sending subgraph traces to the gateway/router in-band in the response. And not every GraphQL server supports full execution tracing in the first place. If we support referenced fields on the Fields page, then you can run a gateway/router in front of an arbitrary subgraph without federated tracing (whether you want to do this for performance or lack of implementation reasons) and still get some useful data on the Fields page. (You can then perhaps run full tracing on a sampled subset of queries to get reasonable approximate execution data too.)

@hwillson hwillson added this to the Reporting milestone Sep 14, 2021
@glasser
Copy link
Member

glasser commented Sep 21, 2021

Just as an external status update: this project had two halves. One is an internal refactoring to improve how related data is stored in our backend databases, and that has now been implemented. The next step is to make the Apollo Server related changes to report field usage differently. This won't happen until October.

@glasser glasser added 2021-10 and removed 2021-09 labels Sep 21, 2021
@hwillson hwillson added the size/large Estimated to take MORE THAN A WEEK label Oct 1, 2021
@RichardWright
Copy link

If you could add a diagnostic channel (https://nodejs.org/api/diagnostics_channel.html) for graphql render times, that would be fantastic.

@glasser
Copy link
Member

glasser commented Oct 19, 2021

@RichardWright That sounds like something you could build with the plugin API? Seems pretty separate from Studio usage reporting.

@hwillson hwillson added 2021-11 and removed 2021-10 labels Nov 1, 2021
@glasser glasser changed the title Usage reporting performance improvements Extend Studio usage reporting to support referenced fields in addition to executed fields Nov 10, 2021
@alexandernanberg
Copy link

Will you be able to use this without Apollo Studio? E.g. if you would like to use a custom server solution?

@glasser
Copy link
Member

glasser commented Dec 1, 2021

This is an improvement to a feature of Apollo Studio. You certainly could use the open source usage reporting plugin as a basis for reporting to something else if you'd like, but that's not a goal of this project.

@glasser glasser added 2021-12 and removed 2021-11 labels Dec 10, 2021
glasser added a commit that referenced this issue Dec 17, 2021
In Apollo Studio, the "Fields" page lets you see how often your fields
are *executed* by operations --- ie, how often their resolvers run. It
additionally lets you see which clients and operations ran those
operations.

However, knowing if an operation *executed* a field doesn't tell the
whole story about the relationship between operations and fields.  You
might also be curious to learn if a field was textually referenced in
the operation itself.

It's possible for a field to be referenced without executing. In fact,
there are many reasons this can happen:
- The field is nested under another field which evaluates to null
- The field is nested under another field which evaluates to an empty
  list
- The field is nested under a non-matching fragment
- The field is nested under `@include` or `@skip`

If you're using the Fields page to determine "what are all the
operations that use this field", you probably want to know about these
usages too!

It's also possible for a field to be executed without being referenced.
That's because when we track "executed fields", we are always tracking
the concrete object type that is being executed, not an interface type
it may have been resolved through.  So with schema:

```graphql
interface Animal {
  legs: Int
}
type Dog implements Animal {
  legs: Int
}
type Query {
  myFavoriteAnimal: Animal  # In practice, always returns a Dog
}
```

the operation `{ myFavoriteAnimal { legs } }` "references"
`Animal.legs`, not `Dog.legs`, but it is the field `Dog.legs` that is
executed.

(Additionally, when using federation, fields can be executed without
being referenced in the operation if the query plan requires them to be
executed to fulfill an `@requires` or `@key` directive.)

This PR extends Apollo Server's usage reporting plugin to provide a list
of "referenced fields" along with every operation. Note that these
fields depend only on the operation, not on the variables passed or
anything at execution time, so we don't need one set per trace, just one
set per operation.

In addition to the fact that this statistic is an interesting one, this
will also mean that the Fields page can be useful without needing full
execution tracing. There is a real performance impact of full execution
tracing, especially in the federation context where the ftv1 protocol
involves sending subgraph traces to the gateway/router in-band in the
response.  And not every GraphQL server supports full execution tracing
in the first place.  With support for referenced fields on the Fields
page, you will be able to run a gateway/router in front of an arbitrary
subgraph without federated tracing (whether you want to do this for
performance or lack of implementation reasons) and still get some useful
data on the Fields page.  (You can then perhaps run full tracing on a
sampled subset of queries to get reasonable approximate execution data
too.)

Note that the new Studio functionality may not be in general
availability when this PR merges into the Apollo Server release branch
or when the first alphas are created; it will be fully enabled
(including docs) by the time this PR is released in a non-prerelease.

As part of the implementation, we extend the existing "signature cache"
to cache the referenced field list as well. Note that the signature is a
pure function of the operation, whereas the referenced field list also
depends on the schema, so we add a little mechanism to throw out the
cache if the schema changes.

Fixes #5708.
glasser added a commit that referenced this issue Dec 22, 2021
The usage reporting plugin already has an includeRequest option which
allows you to tell the plugin to completely ignore an operation.

One reason you may have wanted to use this is to avoid the overhead of
capturing a full field-by-field execution trace. But in this case,
includeRequest=false is overkill: it removes the operation from aspects
of Studio such as the Operations page and schema checks which don't
require full execution traces to operate.

This PR adds a new option, captureTraces, which is invoked after
includeRequest returns true.  If you return false from this operation,
you won't incur the overhead of capturing a detailed trace (either
directly in this process, or in subgraphs if this process is a Gateway).

Most of Studio's features (including the newly-added "referencing
operations" column on the Fields page) will still reflect the existence
of this operation. As of the end of 2021, the features that this
operation will not contribute to are:

- The "field executions" column on the Fields page
- The per-field timing hints shown in Explorer and in vscode-graphql
- The trace view on the Operations page

You can also pass a number between 0 or 1, which will be interpreted as
randomly capturing traces that percent of a time. (Specifying this as a
number rather than a function that invokes Math.random means that a
future version of Apollo Server and Studio could include this number in
the usage report and use it to let you view an estimate for total field
executions based on using the number as a scale; this is however not yet
implemented.)

Since this may provide a common use case for running a GraphQL server
without any need for per-field instrumentation, the request pipeline now
only instruments the schema if it ever sees a willResolveField callback.
This *almost* means that if you're running a monolithic server with
captureTraces always returning false (or usage reporting disabled), or
if you're running a subgraph whose gateway has captureTraces:false (and
thus never receives a request with the `apollo-federation-include-trace`
header), then execution won't have the small performance impact of
instrumentation. In practice you need to also disable the cache control
plugin to get this speedup, as it is installed by default and uses
willResolveField to implement dynamic cache control. If this
optimization proves to be important we can provide a mode of the cache
control plugin that doesn't allow for dynamic cache control.
(Alternatively we may eventually find a way to support the
instrumentation of GraphQL execution with lower overhead.)

Part of #5708.
glasser added a commit that referenced this issue Dec 23, 2021
The usage reporting plugin already has an includeRequest option which
allows you to tell the plugin to completely ignore an operation.

One reason you may have wanted to use this is to avoid the overhead of
capturing a full field-by-field execution trace. But in this case,
includeRequest=false is overkill: it removes the operation from aspects
of Studio such as the Operations page and schema checks which don't
require full execution traces to operate.

This PR adds a new option, captureTraces, which is invoked after
includeRequest returns true.  If you return false from this operation,
you won't incur the overhead of capturing a detailed trace (either
directly in this process, or in subgraphs if this process is a Gateway).

Most of Studio's features (including the newly-added "referencing
operations" column on the Fields page) will still reflect the existence
of this operation. As of the end of 2021, the features that this
operation will not contribute to are:

- The "field executions" column on the Fields page
- The per-field timing hints shown in Explorer and in vscode-graphql
- The trace view on the Operations page

You can also pass a number between 0 or 1, which will be interpreted as
randomly capturing traces that percent of a time. (Specifying this as a
number rather than a function that invokes Math.random means that a
future version of Apollo Server and Studio could include this number in
the usage report and use it to let you view an estimate for total field
executions based on using the number as a scale; this is however not yet
implemented.)

Since this may provide a common use case for running a GraphQL server
without any need for per-field instrumentation, the request pipeline now
only instruments the schema if it ever sees a willResolveField callback.
This *almost* means that if you're running a monolithic server with
captureTraces always returning false (or usage reporting disabled), or
if you're running a subgraph whose gateway has captureTraces:false (and
thus never receives a request with the `apollo-federation-include-trace`
header), then execution won't have the small performance impact of
instrumentation. In practice you need to also disable the cache control
plugin to get this speedup, as it is installed by default and uses
willResolveField to implement dynamic cache control. If this
optimization proves to be important we can provide a mode of the cache
control plugin that doesn't allow for dynamic cache control.
(Alternatively we may eventually find a way to support the
instrumentation of GraphQL execution with lower overhead.)

Part of #5708.
glasser added a commit that referenced this issue Dec 23, 2021
The usage reporting plugin already has an `includeRequest` option which
allows you to tell the plugin to completely ignore an operation.

One reason you may have wanted to use this is to avoid the overhead of
capturing a full field-by-field execution trace. But in this case,
`includeRequest: false` is overkill: it removes the operation from aspects
of Studio such as the Operations page and schema checks which don't
require full execution traces to operate.

This PR adds a new option, `fieldLevelInstrumentation`, which is invoked after
includeRequest returns true.  If you return false from this operation, you won't
incur the overhead of capturing a detailed trace (either directly in this
process, or in subgraphs if this process is a Gateway).

Most of Studio's features (including the newly-added "referencing
operations" column on the Fields page) will still reflect the existence
of this operation. As of the end of 2021, the features that this
operation will not contribute to are:

- The "field executions" column on the Fields page
- The per-field timing hints shown in Explorer and in vscode-graphql
- The trace view on the Operations page

Apollo Server now sends both an "observed execution count" and "estimated
execution count" for each field (for each operation and client). The former is
literally how many times we saw the field get executed (only during operations
where `fieldLevelInstrumentation` returned truthy). If the hook only returns
true or false, the latter is the same, but you may also return a number between
0 and 1 which represents the sample rate for that operation; the "estimated
execution count" will be incremented by the reciprocal of that number instead of
by 1.

So for example, with:

    fieldLevelInstrumentation: () => Math.random() < 0.01 ? 0.01 : false

Apollo Server will instrument 1% of operations, and the "estimated execution
count" will be 100 times more than the observed execution count. (You can
imagine more sophisticated implementations of `fieldLevelInstrumentation` which
sample more common operations more aggressively than rare operations.)

If you pass a number for `fieldLevelInstrumentation`, it is equivalent to
passing a function of the form in the above example; that is, the previous
example behaves identically to `fieldLevelInstrumentation: 0.01`.

The `latency_count` sent with field stats (which powers per-field timing hints)
is now always scaled by the given fact (ie, there's no separate "observed
histogram").

Since this may provide a common use case for running a GraphQL server without
any need for per-field instrumentation, the request pipeline now only
instruments the schema if it ever sees a willResolveField callback. This
*almost* means that if you're running a monolithic server with
fieldLevelInstrumentation always returning false (or usage reporting disabled),
or if you're running a subgraph whose gateway has
fieldLevelInstrumentation:false (and thus never receives a request with the
`apollo-federation-include-trace` header), then execution won't have the small
performance impact of instrumentation. In practice you need to also disable the
cache control plugin to get this speedup, as it is installed by default and uses
willResolveField to implement dynamic cache control. If this optimization proves
to be important we can provide a mode of the cache control plugin that doesn't
allow for dynamic cache control. (Alternatively we may eventually find a way to
support the instrumentation of GraphQL execution with lower overhead.)

Part of #5708.
glasser added a commit that referenced this issue Dec 23, 2021
The usage reporting plugin already has an `includeRequest` option which
allows you to tell the plugin to completely ignore an operation.

One reason you may have wanted to use this is to avoid the overhead of
capturing a full field-by-field execution trace. But in this case,
`includeRequest: false` is overkill: it removes the operation from aspects
of Studio such as the Operations page and schema checks which don't
require full execution traces to operate.

This PR adds a new option, `fieldLevelInstrumentation`, which is invoked after
includeRequest returns true.  If you return false from this operation, you won't
incur the overhead of capturing a detailed trace (either directly in this
process, or in subgraphs if this process is a Gateway).

Most of Studio's features (including the newly-added "referencing
operations" column on the Fields page) will still reflect the existence
of this operation. As of the end of 2021, the features that this
operation will not contribute to are:

- The "field executions" column on the Fields page
- The per-field timing hints shown in Explorer and in vscode-graphql
- The trace view on the Operations page

Apollo Server now sends both an "observed execution count" and "estimated
execution count" for each field (for each operation and client). The former is
literally how many times we saw the field get executed (only during operations
where `fieldLevelInstrumentation` returned truthy). If the hook only returns
true or false, the latter is the same, but you may also return a number between
0 and 1 which represents the sample rate for that operation; the "estimated
execution count" will be incremented by the reciprocal of that number instead of
by 1.

So for example, with:

    fieldLevelInstrumentation: () => Math.random() < 0.01 ? 0.01 : false

Apollo Server will instrument 1% of operations, and the "estimated execution
count" will be 100 times more than the observed execution count. (You can
imagine more sophisticated implementations of `fieldLevelInstrumentation` which
sample more common operations more aggressively than rare operations.)

If you pass a number for `fieldLevelInstrumentation`, it is equivalent to
passing a function of the form in the above example; that is, the previous
example behaves identically to `fieldLevelInstrumentation: 0.01`.

The `latency_count` sent with field stats (which powers per-field timing hints)
is now always scaled by the given fact (ie, there's no separate "observed
histogram").

Note that the semantics of the `requestContext.metrics.captureTraces` field
changes with this PR. Previously its value matched the value returned by the
`includeRequest` hook; now it matches the truthiness of the value returned by
the `fieldLevelInstrumentation` hook. This field determines whether Apollo
Gateway includes the `apollo-federation-include-trace` header with outgoing
requests so this is an appropriate change.

Since this may provide a common use case for running a GraphQL server without
any need for per-field instrumentation, the request pipeline now only
instruments the schema if it ever sees a willResolveField callback. This
*almost* means that if you're running a monolithic server with
fieldLevelInstrumentation always returning false (or usage reporting disabled),
or if you're running a subgraph whose gateway has
fieldLevelInstrumentation:false (and thus never receives a request with the
`apollo-federation-include-trace` header), then execution won't have the small
performance impact of instrumentation. In practice you need to also disable the
cache control plugin to get this speedup, as it is installed by default and uses
willResolveField to implement dynamic cache control. If this optimization proves
to be important we can provide a mode of the cache control plugin that doesn't
allow for dynamic cache control. (Alternatively we may eventually find a way to
support the instrumentation of GraphQL execution with lower overhead.)

Part of #5708.
glasser added a commit that referenced this issue Dec 23, 2021
The usage reporting plugin already has an `includeRequest` option which
allows you to tell the plugin to completely ignore an operation.

One reason you may have wanted to use this is to avoid the overhead of
capturing a full field-by-field execution trace. But in this case,
`includeRequest: false` is overkill: it removes the operation from aspects
of Studio such as the Operations page and schema checks which don't
require full execution traces to operate.

This PR adds a new option, `fieldLevelInstrumentation`, which is invoked after
includeRequest returns true.  If you return false from this operation, you won't
incur the overhead of capturing a detailed trace (either directly in this
process, or in subgraphs if this process is a Gateway).

Most of Studio's features (including the newly-added "referencing
operations" column on the Fields page) will still reflect the existence
of this operation. As of the end of 2021, the features that this
operation will not contribute to are:

- The "field executions" column on the Fields page
- The per-field timing hints shown in Explorer and in vscode-graphql
- The trace view on the Operations page

Apollo Server now sends both an "observed execution count" and "estimated
execution count" for each field (for each operation and client). The former is
literally how many times we saw the field get executed (only during operations
where `fieldLevelInstrumentation` returned truthy). If the hook only returns
true or false, the latter is the same, but you may also return a number between
0 and 1 which represents the sample rate for that operation; the "estimated
execution count" will be incremented by the reciprocal of that number instead of
by 1.

So for example, with:

    fieldLevelInstrumentation: () => Math.random() < 0.01 ? 0.01 : false

Apollo Server will instrument 1% of operations, and the "estimated execution
count" will be 100 times more than the observed execution count. (You can
imagine more sophisticated implementations of `fieldLevelInstrumentation` which
sample more common operations more aggressively than rare operations.)

If you pass a number for `fieldLevelInstrumentation`, it is equivalent to
passing a function of the form in the above example; that is, the previous
example behaves identically to `fieldLevelInstrumentation: 0.01`.

The `latency_count` sent with field stats (which powers per-field timing hints)
is now always scaled by the given fact (ie, there's no separate "observed
histogram").

Note that the semantics of the `requestContext.metrics.captureTraces` field
changes with this PR. Previously its value matched the value returned by the
`includeRequest` hook; now it matches the truthiness of the value returned by
the `fieldLevelInstrumentation` hook. This field determines whether Apollo
Gateway includes the `apollo-federation-include-trace` header with outgoing
requests so this is an appropriate change.

Since this may provide a common use case for running a GraphQL server without
any need for per-field instrumentation, the request pipeline now only
instruments the schema if it ever sees a willResolveField callback. This
*almost* means that if you're running a monolithic server with
fieldLevelInstrumentation always returning false (or usage reporting disabled),
or if you're running a subgraph whose gateway has
fieldLevelInstrumentation:false (and thus never receives a request with the
`apollo-federation-include-trace` header), then execution won't have the small
performance impact of instrumentation. In practice you need to also disable the
cache control plugin to get this speedup, as it is installed by default and uses
willResolveField to implement dynamic cache control. If this optimization proves
to be important we can provide a mode of the cache control plugin that doesn't
allow for dynamic cache control. (Alternatively we may eventually find a way to
support the instrumentation of GraphQL execution with lower overhead.)

Part of #5708.
glasser added a commit that referenced this issue Dec 28, 2021
The usage reporting plugin already has an `includeRequest` option which
allows you to tell the plugin to completely ignore an operation.

One reason you may have wanted to use this is to avoid the overhead of
capturing a full field-by-field execution trace. But in this case,
`includeRequest: false` is overkill: it removes the operation from aspects
of Studio such as the Operations page and schema checks which don't
require full execution traces to operate.

This PR adds a new option, `fieldLevelInstrumentation`, which is invoked after
includeRequest returns true.  If you return false from this operation, you won't
incur the overhead of capturing a detailed trace (either directly in this
process, or in subgraphs if this process is a Gateway).

Most of Studio's features (including the newly-added "referencing
operations" column on the Fields page) will still reflect the existence
of this operation. As of the end of 2021, the features that this
operation will not contribute to are:

- The "field executions" column on the Fields page
- The per-field timing hints shown in Explorer and in vscode-graphql
- The trace view on the Operations page

Apollo Server now sends both an "observed execution count" and "estimated
execution count" for each field (for each operation and client). The former is
literally how many times we saw the field get executed (only during operations
where `fieldLevelInstrumentation` returned truthy). If the hook only returns
true or false, the latter is the same, but you may also return a number (typically
either 0 or at least 1) which represents a weight for that operation; the "estimated
execution count" will be incremented by that number instead of by 1.

So for example, with:

    fieldLevelInstrumentation: () => Math.random() < 0.01 ? 1/0.01 : false

Apollo Server will instrument 1% of operations, and the "estimated execution
count" will be 100 times more than the observed execution count. (You can
imagine more sophisticated implementations of `fieldLevelInstrumentation` which
sample more common operations more aggressively than rare operations.)

If you pass a number for `fieldLevelInstrumentation`, it is equivalent to
passing a function of the form in the above example; that is, the previous
example behaves identically to `fieldLevelInstrumentation: 0.01`.

The `latency_count` sent with field stats (which powers per-field timing hints)
is now always scaled by the given fact (ie, there's no separate "observed
histogram").

Note that the semantics of the `requestContext.metrics.captureTraces` field
changes with this PR. Previously its value matched the value returned by the
`includeRequest` hook; now it matches the truthiness of the value returned by
the `fieldLevelInstrumentation` hook. This field determines whether Apollo
Gateway includes the `apollo-federation-include-trace` header with outgoing
requests so this is an appropriate change.

Since this may provide a common use case for running a GraphQL server without
any need for per-field instrumentation, the request pipeline now only
instruments the schema if it ever sees a willResolveField callback. This
*almost* means that if you're running a monolithic server with
fieldLevelInstrumentation always returning false (or usage reporting disabled),
or if you're running a subgraph whose gateway has
fieldLevelInstrumentation:false (and thus never receives a request with the
`apollo-federation-include-trace` header), then execution won't have the small
performance impact of instrumentation. In practice you need to also disable the
cache control plugin to get this speedup, as it is installed by default and uses
willResolveField to implement dynamic cache control. If this optimization proves
to be important we can provide a mode of the cache control plugin that doesn't
allow for dynamic cache control. (Alternatively we may eventually find a way to
support the instrumentation of GraphQL execution with lower overhead.)

Part of #5708.
@glasser
Copy link
Member

glasser commented Dec 30, 2021

This is released in Apollo Server 3.6.0.

@hwillson hwillson removed size/large Estimated to take MORE THAN A WEEK 2021-12 labels Jun 24, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants