API Core: Fix race in 'BackgroundConsumer._thread_main'. #8883

tseaver · 2019-08-01T17:17:19Z

See #7817.

In addition to passing all api_core tests locally, I have verified that the firestore and pubsub system tests all run cleanly with this patch (an earlier version caused them to hang).

Closes #7817.

- Bogus changes to pubsub and firestore to ensure their systests pass on CI.

This reverts commit 5d79f66.

plamut

Looks good, it solves one part of the issue.

There is still a possibility that Consumer gets paused right after we release the lock, and just before we fetch a new response from the server. However, I don't think there is a straightforward way of addressing that

Since the self._bidi_rpc.recv() might block, we must release the _wake lock, otherwise shutting down the background consumer could get blocked because of it (details). This was nicely caught by system tests with an earlier version of the fix.

All in all this PR represents an improvement overall, thus approving.

tseaver · 2019-08-01T21:06:59Z

@plamut I updated the PR description / commit message to avoid closing #7817. Can you please clarify how we might reproduce / fix the remaining part?

plamut · 2019-08-02T13:11:43Z

Reproducing the issue

I did not manage to reproduce it with a real application (the time window is narrow), but I came up with a test that detects if recv() is called in BackgroundConsumer when the latter is in paused state (should not happen) - https://gist.github.com/plamut/8f996fdc9113e8de2b2c050befb36ff6

The idea is to constantly pause/resume the consumer and check if the latter is indeed paused while recv() is being executed.

The test consistently fails on my machine, but also passes if one comments out the following:

pause_resume_thread.start()

Can anyone confirm my findings and proof-read the test?

Fixing the issue

Apparently, we must be holding the self._wake lock when receiving messages, otherwise another thread can pause the consumer in the meantime, and the messages will be received and delivered in a paused state.

However, simply indenting the following block into the with self._wake: context would introduce other, more serious, problems:

_LOGGER.debug("waiting for recv.")
response = self._bidi_rpc.recv()
_LOGGER.debug("recved response.")
self._on_response(response)

Since self._bidi_rpc.recv() might block, holding the self._wake lock would also block stopping the consumer, for instance, because that operation would also try to obtain the very same self._wake lock (details).

Is there a good way to make the recv() method non-blocking? Probably not?

Estimation of the issue impact

I'm not familiar with Firestore, thus I can only say for PubSub.

The background consumer gets paused when the streaming pull manager determines that the client currently has more than MAX_LOAD messages on its hands, and resumes the consumer when there is enough capacity.

If the consumer main loop detects the paused state too late, one extra batch of server messages will be received before pausing in the next iteration.

For the PubSub client, this would mean receiving one extra batch of messages that would sit in the holding buffer until they can be processed. The extra memory usage is thus bounded by the maximum size of a single response from the server - probably not a showstopper.

Fix race in 'BackgroundConsumer._thread_main'.

d6f970d

Closes #7817.

tseaver added api: pubsub Issues related to the Pub/Sub API. api: core api: firestore Issues related to the Firestore API. labels Aug 1, 2019

tseaver requested review from crwilcox, plamut and busunkim96 August 1, 2019 17:17

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Aug 1, 2019

REMOVE THIS COMMIT BEFORE MERGING!

5d79f66

- Bogus changes to pubsub and firestore to ensure their systests pass on CI.

tseaver requested review from anguillanneuf and frankyn as code owners August 1, 2019 17:19

Revert "REMOVE THIS COMMIT BEFORE MERGING!"

523fc79

This reverts commit 5d79f66.

tseaver added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 1, 2019

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 1, 2019

plamut approved these changes Aug 1, 2019

View reviewed changes

tseaver merged commit 5c7246f into googleapis:master Aug 1, 2019

tseaver deleted the 7817-api_core-fix-bidi-race branch August 1, 2019 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Core: Fix race in 'BackgroundConsumer._thread_main'. #8883

API Core: Fix race in 'BackgroundConsumer._thread_main'. #8883

tseaver commented Aug 1, 2019 •

edited

Loading

plamut left a comment •

edited

Loading

tseaver commented Aug 1, 2019

plamut commented Aug 2, 2019

API Core: Fix race in 'BackgroundConsumer._thread_main'. #8883

API Core: Fix race in 'BackgroundConsumer._thread_main'. #8883

Conversation

tseaver commented Aug 1, 2019 • edited Loading

plamut left a comment • edited Loading

Choose a reason for hiding this comment

tseaver commented Aug 1, 2019

plamut commented Aug 2, 2019

Reproducing the issue

Fixing the issue

Estimation of the issue impact

tseaver commented Aug 1, 2019 •

edited

Loading

plamut left a comment •

edited

Loading