Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destroy streams created in rejected Promise.all() #609

Closed
wants to merge 2 commits into from

Conversation

matthew-white
Copy link
Member

@matthew-white matthew-white commented Sep 16, 2022

Closes #482. See the commit message of the second commit for additional notes.

What has been done to verify that this works as intended?

I deployed this PR to the QA server and followed the reproduction steps in #482. As today and as expected, 9/10 concurrent requests were successful. However, unlike today, there was no database connection leak after the 10 responses.

We're not sure yet that we're going to ship this PR (see the next section). If we decide to, then I'll write unit tests of allWithStreams().

Why is this the best possible solution? Were any other approaches considered?

We want to add stream timeout functionality along the lines of #599/#604. However, we also want some way to destroy streams as soon as we know they can be. #604 (comment) mentions two options along those lines. This PR represents option 1, while #608 represents option 2.

I first tried running with option 2. However, when @alxndrsn ran the benchmarker against #608, the results were lackluster. I've pushed this PR in part so that @alxndrsn can run the benchmarker against it. If the PR is more performant than #608 (as I think we expect), then we'll probably want to ship it (as part of either v1.5.3 or v2022.3).

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

The hope is that this PR fixes connection leak issues. However, there's a risk of regression with each of the three streaming endpoints that the PR modifies. I think that if the results from the benchmarker are good, that should help us gain confidence in the PR.

Does this change require updates to the API documentation? If so, please update docs/api.md as part of this PR.

No changes needed.

Before submitting this PR, please make sure you have:

  • run make test-full and confirmed all checks still pass OR confirm CircleCI build passes
  • verified that any code from external sources are properly credited in comments

Closes #482. This commit prevents the following sequence:

- A stream is created within a Promise.all().
- The stream is to be processed after the Promise.all() is fulfilled.
- However, another promise in the Promise.all() is rejected.
- As a result, the Promise.all() as a whole is rejected.
- As a result, the stream is never processed and isn't destroyed.
// that resolves to a stream, then the fate of the stream will be tied to the
// fate of each promise within `values`. If any promise is rejected, then the
// stream will be immediately destroyed.
const allWithStreams = (values) => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the name allWithStreams(), but I couldn't think of anything better. 🤔

};

for (const value of values) {
Promise.resolve(value)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually call resolve() from lib/util/promise.js rather than call Promise.resolve() directly. However, resolve() isn't otherwise used in this file, and importing it resulted in ESLint errors: there are several variables in this file that are also named resolve. It feels preferable to use Promise.resolve() here rather than rename those variables or alias resolve().

Comment on lines +130 to +133
if (result instanceof PartialPipe) {
for (const stream of result.streams) handleStream(stream);
} else if (typeof result === 'object' && result != null &&
typeof result.pipe === 'function') {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be able to handle both a regular stream and a PartialPipe. It looks like Submissions.streamForExport() returns a promise of a PartialPipe, while SubmissionAttachments.streamForExport() returns a promise of a stream.

@matthew-white
Copy link
Member Author

In #604, I ran the benchmarker against this PR (among other commits). This was surprising to me, but the results were decidedly lackluster.

@alxndrsn actually noticed that Backend crashed when he ran the benchmarker against this PR locally. @alxndrsn wrote on Slack:

i'm guessing it's cleaning up slightly too enthusiastically. that, or something isn't getting told that the stream is shut down... so that might indicate a bug in slonik / my slonik patch

In 3/4 of the CircleCI builds that #604 (comment) links to, the following error message can be seen:

request to http://localhost:8383/v1/projects/1/forms/250_questions/submissions.csv.zip?splitSelectMultiples=true&groupPaths=true&deletedFields=true failed, reason: connect ECONNREFUSED 127.0.0.1:8383

I think that's a sign that Backend is crashing. In the one build without that error, the benchmarker results were quite good. Hopefully there's a way for us to prevent Backend from crashing! Among all the commits tested, only this PR resulted in that error message, so I do think there's something in this PR specifically that's going wrong.

@alxndrsn
Copy link
Contributor

alxndrsn commented Sep 20, 2022

The errors may also be bugs in pg, and may not be specific to this branch:

Observed on this branch:

ConnectionError: timeout exceeded when trying to connect
    at Object.createConnection (/odk/backend/node_modules/slonik/dist/src/factories/createConnection.js:54:23)
    at runNextTicks (internal/process/task_queues.js:60:5)
    at listOnTimeout (internal/timers.js:526:9)
    at processTimers (internal/timers.js:500:7)ConnectionError: timeout exceeded when trying to connect
    at Object.createConnection (/odk/backend/node_modules/slonik/dist/src/factories/createConnection.js:54:23)
    at runNextTicks (internal/process/task_queues.js:60:5)
    at listOnTimeout (internal/timers.js:526:9)
    at processTimers (internal/timers.js:500:7)ConnectionError: timeout exceeded when trying to connect
    at Object.createConnection (/odk/backend/node_modules/slonik/dist/src/factories/createConnection.js:54:23)
    at runNextTicks (internal/process/task_queues.js:60:5)
    at listOnTimeout (internal/timers.js:526:9)
    at processTimers (internal/timers.js:500:7)ConnectionError: timeout exceeded when trying to connect
    at Object.createConnection (/odk/backend/node_modules/slonik/dist/src/factories/createConnection.js:54:23)
    at runNextTicks (internal/process/task_queues.js:60:5)
    at listOnTimeout (internal/timers.js:526:9)
    at processTimers (internal/timers.js:500:7)ConnectionError: timeout exceeded when trying to connect
    at Object.createConnection (/odk/backend/node_modules/slonik/dist/src/factories/createConnection.js:54:23)
    at runNextTicks (internal/process/task_queues.js:60:5)
    at listOnTimeout (internal/timers.js:526:9)
    at processTimers (internal/timers.js:500:7)/odk/backend/node_modules/pg/lib/client.js:367
    if (this.activeQuery.name) {
                         ^

TypeError: Cannot read property 'name' of null
    at Client._handleParseComplete (/odk/backend/node_modules/pg/lib/client.js:367:26)
    at Connection.emit (events.js:400:28)
    at Connection.emit (domain.js:470:12)
    at /odk/backend/node_modules/pg/lib/connection.js:115:12
    at Parser.parse (/odk/backend/node_modules/pg-protocol/dist/parser.js:40:17)
    at Socket.<anonymous> (/odk/backend/node_modules/pg-protocol/dist/index.js:10:42)
    at Socket.emit (events.js:400:28)
    at Socket.emit (domain.js:470:12)
    at addChunk (internal/streams/readable.js:290:12)
    at readableAddChunk (internal/streams/readable.js:265:9)
    at Socket.Readable.push (internal/streams/readable.js:204:10)
    at TCP.onStreamRead (internal/stream_base_commons.js:188:23)
make: *** [Makefile:16: run] Error 1

This may be: brianc/node-postgres#2705

I have also seen the following when testing with another branch, while machine was under heavy load:

/odk/backend/node_modules/pg/lib/client.js:360
    this.activeQuery.handleCommandComplete(msg, this.connection)
                     ^

TypeError: Cannot read property 'handleCommandComplete' of null
    at Client._handleCommandComplete (/odk/backend/node_modules/pg/lib/client.js:360:22)
    at Connection.emit (events.js:400:28)
    at Connection.emit (domain.js:470:12)
    at /odk/backend/node_modules/pg/lib/connection.js:115:12
    at Parser.parse (/odk/backend/node_modules/pg-protocol/dist/parser.js:40:17)
    at Socket.<anonymous> (/odk/backend/node_modules/pg-protocol/dist/index.js:10:42)
    at Socket.emit (events.js:400:28)
    at Socket.emit (domain.js:470:12)
    at addChunk (internal/streams/readable.js:290:12)
    at readableAddChunk (internal/streams/readable.js:265:9)
    at Socket.Readable.push (internal/streams/readable.js:204:10)
    at TCP.onStreamRead (internal/stream_base_commons.js:188:23)
make: *** [Makefile:16: run] Error 1

This may be: brianc/node-postgres#1872

And also:

/odk/backend/node_modules/pg/lib/client.js:367
    if (this.activeQuery.name) {
                         ^

TypeError: Cannot read property 'name' of null
    at Client._handleParseComplete (/odk/backend/node_modules/pg/lib/client.js:367:26)
    at Connection.emit (events.js:400:28)
    at Connection.emit (domain.js:470:12)
    at /odk/backend/node_modules/pg/lib/connection.js:115:12
    at Parser.parse (/odk/backend/node_modules/pg-protocol/dist/parser.js:40:17)
    at Socket.<anonymous> (/odk/backend/node_modules/pg-protocol/dist/index.js:10:42)
    at Socket.emit (events.js:400:28)
    at Socket.emit (domain.js:470:12)
    at addChunk (internal/streams/readable.js:290:12)
    at readableAddChunk (internal/streams/readable.js:265:9)
    at Socket.Readable.push (internal/streams/readable.js:204:10)
    at TCP.onStreamRead (internal/stream_base_commons.js:188:23)

It may be a race condition where the activeQuery in pg/lib/client.js has been cleared while the connection to postgres is still returning messages, either live or buffered...

Base automatically changed from release to master September 22, 2022 05:26
@matthew-white
Copy link
Member Author

matthew-white commented Sep 23, 2022

I think the next step is to rebase this PR on the master branch. I think I'll wait to do that until we patch Slonik on the master branch by merging #599 or a similar PR. Once I rebase the PR, it may be useful to:

@alxndrsn
Copy link
Contributor

  • Check pg_stat_activity for a database connection leak at the end of the benchmarker.

PR for a script to do this at: #622

@matthew-white
Copy link
Member Author

I've marked this PR as a draft while we finish working on #564.

@matthew-white
Copy link
Member Author

Closing now that we've decided to close #634.

@matthew-white matthew-white deleted the all-with-streams branch March 22, 2023 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants