Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tls_mgm: don't free anything in mod_destroy() #3269

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jes
Copy link
Contributor

@jes jes commented Dec 11, 2023

Summary
We were observing a crash on shutdown caused by a double-free of the TLS "domains" in tls_mgm. Not massively important because it only happens on shutdown, but still worth fixing.

Details

The issue was that the tls_mgm module was unloaded (free'ing all of its "domains") before the connections were destroyed. This left connections with dangling pointers to the domains, and when the connections are cleaned up, if the reference count is 0, tls_mgm can then try to free the domain again, causing a crash.

Solution

The solution is to defer destroy_modules() until after tcp_destroy().

Compatibility

It is possible that there are other related bugs that could be solved in the same way (i.e. that destroy_modules() should move even further down) but I've tried to be conservative.

Closing issues

Copy link

Any updates here? No progress has been made in the last 30 days, marking as stale.

@github-actions github-actions bot added the stale label Jan 11, 2024
@jes
Copy link
Contributor Author

jes commented Jan 11, 2024

No updates here.

@stale stale bot removed the stale label Jan 11, 2024
Copy link

Any updates here? No progress has been made in the last 30 days, marking as stale.

@github-actions github-actions bot added the stale label Feb 11, 2024
@jes
Copy link
Contributor Author

jes commented Feb 11, 2024

No updates here.

@stale stale bot removed the stale label Feb 11, 2024
@bogdan-iancu
Copy link
Member

@jes , even if I agree with the issue, I do not agree with the solution. You cannot destroy infrastructure resources before giving a chance to the modules to do a proper shutdown - for an example a module may require a TCP conn for its shutdown - this is a theoretical example, I'm not sure if there is such module, not even if you actually can use the TCP layer in the shutdown sequence - but let's check this first.
PS: an realistic example here would be the uac_registrant module doing UN-register upon shutdown.

@bogdan-iancu bogdan-iancu self-assigned this Feb 13, 2024
@jes
Copy link
Contributor Author

jes commented Feb 13, 2024

OK, good point. Then what about making tls_mgm just not free its domains when it's shutting down? If it's shutting down anyway then it won't really "leak" any memory because OpenSIPS is about to exit anyway? (Plus, they get free'd eventually anyway when the TCP connections are cleaned up ;) ).

If you think that might be OK then I'll update the PR.

@bogdan-iancu
Copy link
Member

yeah, indeed, it is a "bit" broken for the tls_mgm module to trash its data while this is still in use by ongoing connections. So, a quick workaround here will be for the tls_mgm module NOT to free anything that may be used by the connections.

This fixes a possible double-free during shutdown.

The issue was that the `tls_mgm` module was unloaded (free'ing all of
its "domains") before the connections were destroyed. This left
connections with dangling pointers to the domains, and when the
connections are cleaned up, if the reference count is 0, `tls_mgm`
can then try to free the domain again, causing a crash.
@jes jes force-pushed the jes/double-free-on-shutdown branch from 5e78cb7 to f1838f2 Compare February 16, 2024 20:59
@jes
Copy link
Contributor Author

jes commented Feb 16, 2024

@bogdan-iancu thanks, I've force-pushed a commit that does that, and I'll change the title of the pull request.

@jes jes changed the title Destroy TCP connections before modules tls_mgm: don't free anything in mod_destroy() Feb 16, 2024
@bogdan-iancu
Copy link
Member

@jes , revisiting a bit this issue and doing some brainstorming with @liviuchircu and @razvancrainea , we got to the conclusion that the code, as it is right now, should work ok.
In mod_destroy, there are multiple free operations done. The tls_domains are freed via the while cycles, but based on the ref counting, with tls_free_domain - so, if a TLS domain is still used from a connection, it will get its ref cnt decremented, but it should not get to 0 and be freed. The following free ops are non-critical, just maps used for searching the tls domains, so not to be used so far.
Do you have some info on how you get a TLS domain freed in mod_destroy() (so with ref cnt 0), but still used by a connection???

@bogdan-iancu
Copy link
Member

@jes , There was a similar report to yours, see #3338 . So, trying to see how to investigate this further (as as per my prev comment, the free'ing should be controlled by refcnt, so it should be ok) - can you somehow (even if randomly) reproduce this crash? I can eventually create a quick patch for troubleshooting those refcnt's

@jes
Copy link
Contributor Author

jes commented Apr 11, 2024

The easiest way for me to reproduce it was to put a deliberate segfault in uac_registrant so that as soon as it starts to try to register a user the worker process segfaults and then OpenSIPS tries to shutdown. This results in 2 core dumps (one for the crashed worker, and one for the double free).

I could be wrong but if you have TLS domains setup I don't think there is any way to get OpenSIPS to try to shutdown without triggering the double free.

But maybe it only gets detected if you are using the debug allocator?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants