Race condition between Server session release and Net read #7285

sudheerv · 2020-10-20T17:02:25Z

This fixes #7284

Defer Http1ServerSession destroying to prevent read buffer
corruption due to race with network i/o

PR #7278 fixed the crash due to race condition during server session acquisition but the fix did not actually solve the problem for the same race during server session closing.

It turns out the root cause is that the read buffer attached to the Server NetVC is actually "owned" by Http1ServerSession object which destroys itself along with the mutex associated when calling do_io_close(). The actual NetVC may be doing a read in a different thread at the same time and is now using the read buffer that is freed by Http1ServerSession. This seems to happen more again with Transform plugins at play. I tried to switch the ownership of the read buffer to the Server NetVC, but, could not get it working correctly as it seems to expose even more race conditions and asserts all over the place.

Instead, the solution I implemented is to not destroy the Server Session during do_io_close(), but add it to a separate unshared/gc session pool, which will eventually clean up these sessions on a subsequent event from either inactivity cop or a stray read that was racing with it. Moving them to a separate gc session pool allows to also guard the event handling with a dedicated mutex against the read. This mutex guard can not be easily added within an SM call back since that'd mean handling a lock failure etc which introduces complexity into the state machine.

shinrich · 2020-10-21T19:19:16Z

What are the cases where the server_netvc is doing a read from a different thread? The server_netvc and the client_netvc and the HttpSM are all associated with the same ET_NET thread for the duration of the HttpSM. So all the net I/O should be performed in the same ET_NET thread. In that case the buffer frees in Http1ServerSession::do_io_close are serialized with the READ/WRITE event processing.

In my reading of the current master, we may need to add do_io_read(nullptr, 0, nullptr) and do_io_write(nullptr, 0, nullptr) in the Http1ServerSession::do_io_close to ensure that a later net/io event does not get sent back to the now deleted continuation. But if the server_vc is still set, the closed flag should already be set which should block any future net/IO on that netvc.

Defer Http1ServerSession destroying to prevent read buffer corruption due to race with network i/o

sudheerv · 2020-10-26T16:32:15Z

What are the cases where the server_netvc is doing a read from a different thread? The server_netvc and the client_netvc and the HttpSM are all associated with the same ET_NET thread for the duration of the HttpSM. So all the net I/O should be performed in the same ET_NET thread. In that case the buffer frees in Http1ServerSession::do_io_close are serialized with the READ/WRITE event processing.

This seems to happen when we use a plugin which creates response Transform. We've a couple of custom plugins that perform Server Side Rendering (sort of like combo_handler plugin) by parsing a HTML origin response and fire off sideways calls to fetch the assets listed in the HTML response and then eventually stitch those responses together following a template (HTML in some cases, dust in some cases). The sideways requests are performed using TS Fetch API. I tried to set the thread affinity in the transform continuation and the sideways Fetch calls, yet couldn't seem to force all the associated requests to the same thread.

In my reading of the current master, we may need to add do_io_read(nullptr, 0, nullptr) and do_io_write(nullptr, 0, nullptr) in the Http1ServerSession::do_io_close to ensure that a later net/io event does not get sent back to the now deleted continuation. But if the server_vc is still set, the closed flag should already be set which should block any future net/IO on that netvc.

The problem is do_io_close() is called from SM/Tunnel continuations which presumably guard using SM/Tunnel mutex. But, somehow that doesn't seem to block the race condition which makes me think that when Transforms are at play, the network read VIO isn't really being guarded by the SM/Tunnel Mutex anymore. Either way, it feels that once do_io_close() is called and it resets the continuation (or even a do_io_read(null, 0, null) for that matter) the network i/o is unblocked and potentially try to use the now freed MIOBuffer for the read. What we really need is a dedicated mutex for net read and use that mutex to lock the SM/Tunnel as well. But. I did not want to affect the performance for the normal cases as well as also introduce complexity by having to handle such lock failures in the SM/Tunnel continuations. So, as a compromise, I basically deferred/delayed the buffer release until the server session is detached from the SM and is put on the idle pool at which point adding an additional mutex should not impact request processing or latency. We can discuss more on slack/zoom or during the summit.

shinrich · 2020-10-27T12:56:38Z

I think it would be good to write an autest to exercise this scenario. I was able to add an autest in PR #7295 that exercised the threading problem that @calavera was in the SSN_START hook. I'll try to add one for this case. Is your plugin like the multiplexer? Except some part of the logic is executed from a Task thread?

sudheerv · 2020-10-27T16:15:09Z

I think it would be good to write an autest to exercise this scenario. I was able to add an autest in PR #7295 that exercised the threading problem that @calavera was in the SSN_START hook. I'll try to add one for this case. Is your plugin like the multiplexer? Except some part of the logic is executed from a Task thread?

That'd be awesome! Thanks @shinrich

The plugin is more like a esi/combo-handler. So, user requests a base page URL. The plugin creates a response transform on the HTML response, parses it, extracts the embedded links for static assets etc in the response and fetches those assets and replaces those links with the fetched responses. The final response going to the user is the fully stitched response with pretty much all the links resolved.

bryancall · 2021-01-28T22:55:19Z

@sudheerv Can you please rebase this and @shinrich will take a look at it then.

github-actions · 2021-06-18T18:36:13Z

This pull request has been automatically marked as stale because it has not had recent activity. Marking it stale to flag it for further consideration by the community.

bryancall · 2021-06-25T16:12:11Z

@sudheerv If this is still an issue please reopen.

sudheerv added Crash Server Session labels Oct 20, 2020

sudheerv added this to the 10.0.0 milestone Oct 20, 2020

sudheerv self-assigned this Oct 20, 2020

sudheerv requested a review from shinrich October 20, 2020 17:05

sudheerv force-pushed the vcclose branch 2 times, most recently from 2d74646 to 3d632d4 Compare October 21, 2020 16:15

sudheerv force-pushed the vcclose branch 4 times, most recently from b1bd6e8 to ff4df84 Compare October 22, 2020 19:53

Race condition between Server session release and Net read

9519821

Defer Http1ServerSession destroying to prevent read buffer corruption due to race with network i/o

sudheerv force-pushed the vcclose branch from ff4df84 to 9519821 Compare October 22, 2020 23:25

zwoop added the OnDocs This is for PR currently running, or will run, on the Docs ATS server label Oct 23, 2020

github-actions bot added the Stale label Jun 18, 2021

bryancall added the Bug label Jun 25, 2021

bryancall closed this Jun 25, 2021

bryancall removed this from the 10.0.0 milestone Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition between Server session release and Net read #7285

Race condition between Server session release and Net read #7285

sudheerv commented Oct 20, 2020 •

edited

Loading

shinrich commented Oct 21, 2020 •

edited

Loading

sudheerv commented Oct 26, 2020

shinrich commented Oct 27, 2020 •

edited

Loading

sudheerv commented Oct 27, 2020

bryancall commented Jan 28, 2021

github-actions bot commented Jun 18, 2021

bryancall commented Jun 25, 2021

Race condition between Server session release and Net read #7285

Race condition between Server session release and Net read #7285

Conversation

sudheerv commented Oct 20, 2020 • edited Loading

shinrich commented Oct 21, 2020 • edited Loading

sudheerv commented Oct 26, 2020

shinrich commented Oct 27, 2020 • edited Loading

sudheerv commented Oct 27, 2020

bryancall commented Jan 28, 2021

github-actions bot commented Jun 18, 2021

bryancall commented Jun 25, 2021

sudheerv commented Oct 20, 2020 •

edited

Loading

shinrich commented Oct 21, 2020 •

edited

Loading

shinrich commented Oct 27, 2020 •

edited

Loading