Gateway behaves poorly when the first updateCompositionConfig call throws #335

glasser · 2020-07-31T18:32:04Z

If the initial call to updateCompositionConfig in ApolloGateway.load() throws (eg, network error fetching managed federation data), the system behaves poorly.

We never set up the federation service polling loop, so the situation will never recover. apollographql/apollo-server#4277 was supposed to fix this but didn't.

If we fixed that issue in the most straightforward way, we would end up with strange behavior where the schema does load into ApolloServer, but various things we expect to happen after the schema loads don't happen:

The "Gateway successfully loaded schema" message logged by ApolloGateway.load() never happens
More importantly, serverWillStart plugins never get called!

One answer would be that we should just "hard fail" if the first updateCompositionConfig call failed. If ApolloServer's API was different such that there was an async function you were required to call and await at the top level before starting your server, that would make sense, and maybe we'll get there in AC3. Instead, AC2 just has a synchronous constructor, plus some mechanisms to make later operations wait on this.schemaDerivedData, plus the protected willStart function invoked by the http server integration libraries to trigger serverWillStart.

The way that willStart gets used is a little strange. What willStart does is:

If we're using a gateway, block until the initial call to ApolloGateway.load() completes. If that call rejected, return immediately (ignoring the error)
Call and await all serverWillStart plugins, throwing any errors from them.

The http server integration libraries all call willStart when you apply middleware to their framework. What they actually do with the promise returned from willStart is inconsistent:

Some of them (eg hapi) immediately await it. That means any error from a plugin will cause the applyMiddleware to throw.
Some of them (eg express) do the equivalent of awaiting it at the beginning of every HTTP request. That means that any error from a plugin will cause any HTTP request (whether GraphQL or something else like CORS or fetching playground) to fail. Note though that comments make it clear that we do NOT intend errors from ApolloGateway.load() to cause these non-GraphQL requests to fail, though they should at least wait until that load call is done.
Some of them (eg Azure functions) only await the promise while handling GraphQL requests. So other things like CORS or playground don't even wait for load() to finish and errors from plugins don't affect these other requests.

My proposal is that we should run serverWillStart plugins once both of these things have happened:

HTTP server integration library has called willStart
Loading the schema has successfully occurred (possibly not the first poll)

If this happens during the willStart call (because the schema was already loaded when we called willStart, or because the inial load completes successfully during willStart), then it'll work just like now, and any errors from plugins will be thrown by the willStart promise.

If on the other hand, the initial load fails but a later polled load succeeds, we should call serverWillStart plugins at that point. If those plugins throw, we should make actual GraphQL operations throw via ApolloServer.graphQLServerOptions throwing, just like that function currently throws if there is no schema (where it awaits this.schemaDerivedData).

Note that it's probably worth fixing apollographql/apollo-server#4428 in conjunction with this.

The text was updated successfully, but these errors were encountered:

glasser · 2020-09-23T19:42:41Z

Also see apollographql/apollo-server#4588. Perhaps we should actually call serverWillStart once both of these things has happened:

HTTP server integration library has called willStart OR something has called graphQLServerOptions (eg because they're using calling the old express middleware directly instead of using AS.getMiddleware/applyMiddleware?)
Loading the schema has successfully occurred (possibly not the first poll)

glasser · 2020-09-23T22:09:58Z

Eh never mind on the last comment — the old middleware is not actually part of the 2.x API.

glasser · 2021-01-27T23:55:55Z

(Another alternative is that we fix this in a major version bump to Apollo Server and we introduce an async ApolloServer.start() in that new version.)

glasser · 2021-02-03T00:08:12Z

We should also strongly consider making the built-in health check fail when the schema is not loaded.

glasser · 2021-02-04T18:26:56Z

I think the best fix will be to introduce async ApolloServer.start(), which will be optional in AS2 and can become required in AS3. We can continue to have the behavior where things start kinda randomly if you don't call it, but if you want control over when your server starts (including loading the schema) and the ability to handle the error, you can call this.

Full description to come, but tl;dr: Previously server startup worked like this: - AS constructor runs - If no gateway, calculate schema and schema derived data immediately - If gateway, kick off gateway.load from the end of the constructor, and if it async-throws, log an error once and make the server kinda broken forever - At various spots in the framework integration code, call (but not await) protected willStart, which is an async function that first waits for the gateway to load the schema if necessary and then runs serverWillStart plugin functions; save the Promise returned by calling this. - At request time in the framework integration code, await that promise. And also if there's no schema fail with an error. Now server startup works like this: - There's an explicit state machine situation inside AS - AS constructor initializes state with schema directly if no gateway - If there is a gateway the constructor DOES NOT KICK OFF gateway.load - You can now call `await server.start()` yourself, which will first await gateway.load if necessary, and then await all serverWillStart calls - If you're using `apollo-server` rather than an integration, `server.listen()` will just transparently do this for you; explicit `start()` is just for integrations - The integration places that used to call willStart now call `server.ensureStarting()` instead which will kick off server.start in the background if you didn't (and log any errors thrown). - The places that used to await promiseWillStart no longer do so; generally right after that code we end up calling graphqlServerOptions - graphqlServerOptions now awaits `server.ensureStarted` which will start the server if necessary and throw if it threw. Overall changes: - If you're using `apollo-server`, startup errors will cause `listen` to reject, nothing else necessary. - If you're using an integration you are encouraged to call `await server.start()` yourself after the constructor, which will let you detect startup errors. - But if you don't do that, the server will still start up with similar properties. gateway.load won't start until the integration's `ensureStarting` or graphqlServerOptions' `ensuresStarted` though. Errors will be logged. - Also if you don't call `server.start()`, the full stack trace of any startup error will be logged on *every* failed graphql request instead of just a short message suggesting there's more in logs (but it's only at the beginning of your logs) Yes this is the tl;dr version that I jotted off at the end of my work day off the top of my head without going back and skimming through the PR for details :) Fixes #4921. Fixes apollographql/federation#335.

Previously, server startup worked like this: - `new ApolloServer` - If no gateway, calculate schema and schema derived data immediately - If gateway, kick off gateway.load from the end of the constructor, and if it async-throws, log an error once and make the server kinda broken forever - At various spots in the framework integration code, call (but don't await) the protected `willStart` function, which is an async function that first waits for the gateway to load the schema if necessary and then runs serverWillStart plugin functions; save the Promise returned by calling this. - At request time in the framework integration code, await that Promise. And also, if there's no schema, fail with an error. Now server startup works like this: - ApolloServer represents its state explicitly with a new ServerState -`new ApolloServer` - If no gateway, initialize all the schema-derived state directly like before (though the state now lives inside ServerState) - If gateway, the constructor DOES NOT KICK OFF `gateway.load()` - You can now call `await server.start()` yourself, which will first await `gateway.load` if necessary, and then await all serverWillStart calls. - If you're using `apollo-server` rather than an integration, `server.listen()` will just transparently do this for you; explicit `start()` is just for integrations! - The integration places that used to call willStart now call `server.ensureStarting()` instead which will kick off server.start in the background if you didn't (and log any errors thrown). - The places that used to await promiseWillStart no longer do so; generally right after that code we end up calling `graphqlServerOptions` - `graphqlServerOptions` now awaits `server.ensureStarted` which will start the server if necessary and throw if it threw. The overall change to user experience: - If you're using `apollo-server`, startup errors will cause `listen` to reject; no code changes are necessary. - If you're using an integration you are encouraged to call `await server.start()` yourself immediately after the constructor, which will let you detect startup errors. - But if you don't do that, the server will call `start` itself eventually. When you try to execute your first GraphQL request, `start` will happen if it hasn't already. Also an integration call like `server.applyMiddleware` will initiate a background `start`. If startup fails, the startup error will be logged on *every* failed graphql request, not just the first time like happened before. - If you have your own ApolloServer subclass that calls the protected `willStart` method, it won't work before that method is gone. Consider whether you can eliminate that call by just calling `start`, or perhaps call `ensureStarting` instead. This is close enough to backwards-compatible to be appropriate for a v2 minor release. We are likely to make `start()` required in Apollo Server 3 (other than for `apollo-server`). Also: - Previously we used the deprecated `ApolloServer.schema` field to determine whether to install ApolloServerPluginInlineTrace, which we want to have active by default for federated schemas only. If you're using a gateway, this field isn't actually set at the time that ensurePluginInstantiation reads it. That's basically OK because we don't want to turn on the plugin automatically in the gateway, but in the interest of avoiding use of the deprecated field, I refactored it so that `ApolloServerPluginInlineTrace` is installed by default (ie, if you don't install your own version or install `ApolloServerPluginInlineTraceDisabled`) without checking the schema, and then (if it's installed automatically) it decides whether or not to be active by checking the schema at `serverWillStart` time. - Similarly, schema reporting now throws in its `serverWillStart` if the schema is federated, instead of in `ensurePluginInstantiation`. (This does mean that if you're not using the new `start()` or `apollo-server`, that failure won't make your app fail as fast as if the `ApolloServer` constructor threw.) - Fix some fastify tests that used a fixed listen port to not do that. - I am doing my best to never accidentally run `prettier` on whole files and instead to very carefully select specific blocks of the file to format them several times per minute. Apparently I screwed up once and ran it once on `packages/apollo-server-core/src/ApolloServer.ts`. The ratio of "prettier changes" to "actual changes" in that file is low enough that I'd rather just leave the changes in this PR rather than spending time carefully reverting them. (It's one of the files I work on the most and being able to keep it prettier-clean will be helpful anyway.) - Replace a hacky workaround for the lack of `start` in the op reg tests! - Replace a use of a `Barrier` class I added recently in tests with the `@josephg/resolvable` npm package, which does basically the same thing. Use that package in new tests and in the core state machine itself. - While running tests I found that some test files hung if run separately due to lack of cleanup. I ended up refactoring the cache tests to: - make who is responsible for calling cache.close more consistent - make the Redis client mocks self-contained mocks of the ioredis API instead of starting with an actual ioredis implementation and mocking out some internals - clean up Jest fake timers when a certain test is done I'm not super certain exactly which of these changes fixed the hangs but it does seem better this way. (Specifically I think the fake timer fix, which I did last, is what actually fixed it, but the other changes made it easier for me to reason about what was going on.) Can factor out into another PR if helpful. Fixes #4921. Fixes apollographql/federation#335. TODO: - [ ] Go through all docs and READMEs that have 'FIXME start' and add calls to start. This involves verifying that you can actually do top-level await in the contexts that matter. (eg if it turns out that you really can't call await before you assign a handler in Lambda, that's interesting and may require some other changes to this PR!) - [ ] Actually document start() in the apollo-server reference - [ ] Document start() in all the integrations references - [ ] CHANGELOG - [ ] consider whether removing the protected willStart function is OK

Previously, server startup worked like this: - `new ApolloServer` - If no gateway, calculate schema and schema derived data immediately - If gateway, kick off gateway.load from the end of the constructor, and if it async-throws, log an error once and make the server kinda broken forever - At various spots in the framework integration code, call (but don't await) the protected `willStart` function, which is an async function that first waits for the gateway to load the schema if necessary and then runs serverWillStart plugin functions; save the Promise returned by calling this. - At request time in the framework integration code, await that Promise. And also, if there's no schema, fail with an error. Now server startup works like this: - ApolloServer represents its state explicitly with a new ServerState - `new ApolloServer` - If no gateway, initialize all the schema-derived state directly like before (though the state now lives inside ServerState) - If gateway, the constructor DOES NOT KICK OFF `gateway.load()` - You can now call `await server.start()` yourself, which will first await `gateway.load` if necessary, and then await all serverWillStart calls. - If you're using `apollo-server` rather than an integration, `server.listen()` will just transparently do this for you; explicit `start()` is just for integrations! - The integration places that used to call willStart now call `server.ensureStarting()` instead which will kick off server.start in the background if you didn't (and log any errors thrown). - The places that used to await promiseWillStart no longer do so; generally right after that code we end up calling `graphqlServerOptions` - `graphqlServerOptions` now awaits `server.ensureStarted` which will start the server if necessary and throw if it threw. The overall change to user experience: - If you're using `apollo-server`, startup errors will cause `listen` to reject; no code changes are necessary. - If you're using an integration you are encouraged to call `await server.start()` yourself immediately after the constructor, which will let you detect startup errors. - But if you don't do that, the server will call `start` itself eventually. When you try to execute your first GraphQL request, `start` will happen if it hasn't already. Also an integration call like `server.applyMiddleware` will initiate a background `start`. If startup fails, the startup error will be logged on *every* failed graphql request, not just the first time like happened before. - If you have your own ApolloServer subclass that calls the protected `willStart` method, it won't work before that method is gone. Consider whether you can eliminate that call by just calling `start`, or perhaps call `ensureStarting` instead. This is close enough to backwards-compatible to be appropriate for a v2 minor release. We are likely to make `start()` required in Apollo Server 3 (other than for `apollo-server`). Also: - Previously we used the deprecated `ApolloServer.schema` field to determine whether to install ApolloServerPluginInlineTrace, which we want to have active by default for federated schemas only. If you're using a gateway, this field isn't actually set at the time that ensurePluginInstantiation reads it. That's basically OK because we don't want to turn on the plugin automatically in the gateway, but in the interest of avoiding use of the deprecated field, I refactored it so that `ApolloServerPluginInlineTrace` is installed by default (ie, if you don't install your own version or install `ApolloServerPluginInlineTraceDisabled`) without checking the schema, and then (if it's installed automatically) it decides whether or not to be active by checking the schema at `serverWillStart` time. - Similarly, schema reporting now throws in its `serverWillStart` if the schema is federated, instead of in `ensurePluginInstantiation`. (This does mean that if you're not using the new `start()` or `apollo-server`, that failure won't make your app fail as fast as if the `ApolloServer` constructor threw.) - Fix some fastify tests that used a fixed listen port to not do that. - I am doing my best to never accidentally run `prettier` on whole files and instead to very carefully select specific blocks of the file to format them several times per minute. Apparently I screwed up once and ran it once on `packages/apollo-server-core/src/ApolloServer.ts`. The ratio of "prettier changes" to "actual changes" in that file is low enough that I'd rather just leave the changes in this PR rather than spending time carefully reverting them. (It's one of the files I work on the most and being able to keep it prettier-clean will be helpful anyway.) - Replace a hacky workaround for the lack of `start` in the op reg tests! - Replace a use of a `Barrier` class I added recently in tests with the `@josephg/resolvable` npm package, which does basically the same thing. Use that package in new tests and in the core state machine itself. - While running tests I found that some test files hung if run separately due to lack of cleanup. I ended up refactoring the cache tests to: - make who is responsible for calling cache.close more consistent - make the Redis client mocks self-contained mocks of the ioredis API instead of starting with an actual ioredis implementation and mocking out some internals - clean up Jest fake timers when a certain test is done I'm not super certain exactly which of these changes fixed the hangs but it does seem better this way. (Specifically I think the fake timer fix, which I did last, is what actually fixed it, but the other changes made it easier for me to reason about what was going on.) Can factor out into another PR if helpful. Fixes #4921. Fixes apollographql/federation#335. TODO: - [ ] Go through all docs and READMEs that have 'FIXME start' and add calls to start. This involves verifying that you can actually do top-level await in the contexts that matter. (eg if it turns out that you really can't call await before you assign a handler in Lambda, that's interesting and may require some other changes to this PR!) - [ ] Actually document start() in the apollo-server reference - [ ] Document start() in all the integrations references - [ ] CHANGELOG - [ ] consider whether removing the protected willStart function is OK

Previously, server startup worked like this: - `new ApolloServer` - If no gateway, calculate schema and schema derived data immediately - If gateway, kick off gateway.load from the end of the constructor, and if it async-throws, log an error once and make the server kinda broken forever - At various spots in the framework integration code, call (but don't await) the protected `willStart` function, which is an async function that first waits for the gateway to load the schema if necessary and then runs serverWillStart plugin functions; save the Promise returned by calling this. - At request time in the framework integration code, await that Promise. And also, if there's no schema, fail with an error. Now server startup works like this: - ApolloServer represents its state explicitly with a new ServerState - `new ApolloServer` - If no gateway, initialize all the schema-derived state directly like before (though the state now lives inside ServerState) - If gateway, the constructor DOES NOT KICK OFF `gateway.load()` - You can now call `await server.start()` yourself, which will first await `gateway.load` if necessary, and then await all serverWillStart calls. - If you're using `apollo-server` rather than an integration, `server.listen()` will just transparently do this for you; explicit `start()` is just for integrations! - Serverless frameworks also call it automatically for you in the background (kicked off by the constructor) because their startup has to be synchronous; if it fails then future requests will all fail (and log) as before. - The integration places that used to call willStart now call `server.ensureStarting()` instead which will kick off server.start in the background if you didn't (and log any errors thrown). - The places that used to await promiseWillStart no longer do so; generally right after that code we end up calling `graphqlServerOptions` - `graphqlServerOptions` now awaits `server.ensureStarted` which will start the server if necessary and throw if it threw. The overall change to user experience: - If you're using `apollo-server`, startup errors will cause `listen` to reject; no code changes are necessary. - If you're using a serverless integration, the behavior will be relatively similar, except that the startup error will be logged on all requests instead of just the first one. - If you're using an integration you are encouraged to call `await server.start()` yourself immediately after the constructor, which will let you detect startup errors. - But if you don't do that, the server will call `start` itself eventually. When you try to execute your first GraphQL request, `start` will happen if it hasn't already. Also an integration call like `server.applyMiddleware` will initiate a background `start`. If startup fails, the startup error will be logged on *every* failed graphql request, not just the first time like happened before. - If you have your own ApolloServer subclass that calls the protected `willStart` method, it will still work (the method isn't deleted) but you should rewrite it to either `await this.start()` or `this.ensureStarting()` instead. This is close enough to backwards-compatible to be appropriate for a v2 minor release. We are likely to make `start()` required in Apollo Server 3 for non-serverless integrations. Also: - Previously we used the deprecated `ApolloServer.schema` field to determine whether to install ApolloServerPluginInlineTrace, which we want to have active by default for federated schemas only. If you're using a gateway, this field isn't actually set at the time that ensurePluginInstantiation reads it. That's basically OK because we don't want to turn on the plugin automatically in the gateway, but in the interest of avoiding use of the deprecated field, I refactored it so that `ApolloServerPluginInlineTrace` is installed by default (ie, if you don't install your own version or install `ApolloServerPluginInlineTraceDisabled`) without checking the schema, and then (if it's installed automatically) it decides whether or not to be active by checking the schema at `serverWillStart` time. - Similarly, schema reporting now throws in its `serverWillStart` if the schema is federated, instead of in `ensurePluginInstantiation`. (This does mean that if you're not using the new `start()` or `apollo-server`, that failure won't make your app fail as fast as if the `ApolloServer` constructor threw.) - Fix some fastify tests that used a fixed listen port to not do that. - I am doing my best to never accidentally run `prettier` on whole files and instead to very carefully select specific blocks of the file to format them several times per minute. Apparently I screwed up once and ran it once on `packages/apollo-server-core/src/ApolloServer.ts`. The ratio of "prettier changes" to "actual changes" in that file is low enough that I'd rather just leave the changes in this PR rather than spending time carefully reverting them. (It's one of the files I work on the most and being able to keep it prettier-clean will be helpful anyway.) - Replace a hacky workaround for the lack of `start` in the op reg tests! - Replace a use of a `Barrier` class I added recently in tests with the `@josephg/resolvable` npm package, which does basically the same thing. Use that package in new tests and in the core state machine itself. - While running tests I found that some test files hung if run separately due to lack of cleanup. I ended up refactoring the cache tests to: - make who is responsible for calling cache.close more consistent - make the Redis client mocks self-contained mocks of the ioredis API instead of starting with an actual ioredis implementation and mocking out some internals - clean up Jest fake timers when a certain test is done I'm not super certain exactly which of these changes fixed the hangs but it does seem better this way. (Specifically I think the fake timer fix, which I did last, is what actually fixed it, but the other changes made it easier for me to reason about what was going on.) Can factor out into another PR if helpful. Fixes #4921. Fixes apollographql/federation#335. Co-authored-by: Stephen Barlow <stephen@apollographql.com>

glasser · 2021-03-22T23:15:07Z

I have an alpha out (of Apollo Server, not Gateway) that resolves this issue.

First, install the alpha in your app by installing v2.22.0-alpha.0 of whatever Apollo Server packages your app directly depends on. This might look something like

npm i apollo-server-express@2.22.0-alpha.0 apollo-server-core@2.22.0-alpha.0

If you're using the apollo-server package (and calling listen()), that should be enough.

Otherwise, if you're using (say) apollo-server-express, you'll want to insert a call to await server.start() between server = new ApolloServer and server.applyMiddleware. This does mean you'll want to be setting up your server in a context where you can await (an async function, the top level in a .mjs file, etc). This call is technically optional but you don't get the improved error handling if you don't do it!

If you're using a serverless package like apollo-server-lambda, this doesn't fully improve the situation, though I'd still be interested in hearing about how it works for you.

Please provide feedback at #5051.

More details at https://www.apollographql.com/docs/apollo-server/api/apollo-server/#start (though oops, the docs shouldn't be up until the API is in a non-alpha version, so if this doesn't get released pretty soon I might roll that back).

glasser · 2021-03-26T00:46:42Z

This is released as Apollo Server v2.22.0!

abernix transferred this issue from apollographql/apollo-server Jan 15, 2021

glasser mentioned this issue Feb 4, 2021

experimental_pollInterval doesn't continue to poll if a federated service is not reachable at startup #235

Closed

glasser self-assigned this Feb 11, 2021

glasser mentioned this issue Feb 11, 2021

No way to await errors on server startup apollographql/apollo-server#4921

Closed

glasser mentioned this issue Mar 3, 2021

Add async server.start() function apollographql/apollo-server#4981

Merged

5 tasks

glasser closed this as completed in apollographql/apollo-server#4981 Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway behaves poorly when the first updateCompositionConfig call throws #335

Gateway behaves poorly when the first updateCompositionConfig call throws #335

glasser commented Jul 31, 2020

glasser commented Sep 23, 2020

glasser commented Sep 23, 2020

glasser commented Jan 27, 2021

glasser commented Feb 3, 2021

glasser commented Feb 4, 2021

glasser commented Mar 22, 2021 •

edited

Loading

glasser commented Mar 26, 2021

Gateway behaves poorly when the first updateCompositionConfig call throws #335

Gateway behaves poorly when the first updateCompositionConfig call throws #335

Comments

glasser commented Jul 31, 2020

glasser commented Sep 23, 2020

glasser commented Sep 23, 2020

glasser commented Jan 27, 2021

glasser commented Feb 3, 2021

glasser commented Feb 4, 2021

glasser commented Mar 22, 2021 • edited Loading

glasser commented Mar 26, 2021

glasser commented Mar 22, 2021 •

edited

Loading