Skip to content

Conversation

@traeak
Copy link
Contributor

@traeak traeak commented Dec 5, 2023

This is a fix for a race condition (found and debugged in ats92) when a session is taken from the session pool and migrated to another thread. This is with GLOBAL session pools. There are still rare cases where events are still leaking through to the VC in the pre-migrated thread causing various different asserts.

This PR moves thread migration out of the session manager acquire critical section and locks and stops events on the original thread netvc during migration. Global (and likely hybrid) pool performance is improved in this case.

Initial testing using test case verifies that ats master crashes (quickly and a lot) without the patch and runs fine the patch.

Will test in prod for stability.

should close #9690

Also related to: #9689

@traeak traeak self-assigned this Dec 5, 2023
@traeak traeak modified the milestones: 10.0.0, 9.2.4 Dec 5, 2023
// If thread must be migrated, clear out the VC's
// data and event handling on the original thread.
if (ethread != server_vc->get_thread()) {
SCOPED_MUTEX_LOCK(vclock, server_vc->mutex, ethread);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to do a try-lock instead? If the other thread handles a close before you get the lock here, things could disappear from underneath?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just changing that to try-lock ends up with leaked connection messages in diags.log

@traeak traeak marked this pull request as ready for review January 8, 2024 14:27
@traeak traeak marked this pull request as draft January 10, 2024 18:06
@traeak traeak marked this pull request as ready for review January 11, 2024 17:49
(!(match_style & TS_SERVER_SESSION_SHARING_MATCH_MASK_HOSTSNISYNC) || validate_host_sni(sm, first->get_netvc())) &&
(!(match_style & TS_SERVER_SESSION_SHARING_MATCH_MASK_CERT) || validate_cert(sm, first->get_netvc()))) {
zret = HSM_DONE;
auto iter = m_fqdn_pool.find(hostname_hash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this iter variable is re-declared below. This is a nitpick, but it makes it harder to reason about the scope of this variable with the other one introduced.

Copy link
Contributor Author

@traeak traeak Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 different scopes, the iterators are different types (perhaps kill the auto and make it more explicit?). Generally I only renamed "first" to "iter", and then moved the confusing zret logic down to the bottom of the function.

The changes to this function don't contribute to the PR itself, it's just cleanup.

Copy link
Member

@shinrich shinrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks clean to me.

@traeak traeak merged commit 94f366d into apache:master Feb 1, 2024
phongn pushed a commit to phongn/trafficserver that referenced this pull request Feb 1, 2024
* fix race condition with session thread migration

* document why event_loop could be null

* do less work if per thread session pool

* remove extra unecessasry do_io_write call in critical section
traeak added a commit to traeak/trafficserver that referenced this pull request Feb 27, 2024
traeak added a commit that referenced this pull request Feb 28, 2024
cmcfarlen pushed a commit that referenced this pull request Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

ats92/master crash

3 participants