-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
async_tcp_client: remove callbacks if connection was not closed #35410
Conversation
Signed-off-by: Ohad Vano <ohadvano@gmail.com>
Signed-off-by: Ohad Vano <ohadvano@gmail.com>
/assign @botengyao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, and thanks for the fix!
Do you mind also adding an integration test to tcp_async_client_integration_test.cc
?
I don't think this is testable in integration tests, but if you find a way to, please lmk. |
The existing tcp async client integration test is using a fake filter, and the life time of the async client in the test filter is the one as the downstream connection which has the longest lifetime. To simulate this scenario, we can create a customized usage that is owning this async client and then destroy the object at first, one example is I think this can be simulated by unit tests with the control of the life time. |
Are you referring to the close() and cleanup() sections, if so, then this is not enough. This would just do a regular close of the upstream connection, and there won't be a crash.
The changes I made to the unit tests will test this. Without the fix, many of the unit tests in the file will fail (those that don't explicitly call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the close() and cleanup() sections
To really simulate the issue, the AsyncTcpClient itself needs to be deleted before FakeUpstream is being destroyed, while still having an active upstream connection
Nope, like you described, we can establish an object that is using this async client to manually destroy it at first.
LGTM, and deferred to @ggreenway for another pass.
/assign @ggreenway |
Signed-off-by: Ohad Vano <ohadvano@gmail.com>
connection_->removeConnectionCallbacks(*this); | ||
} | ||
|
||
close(Network::ConnectionCloseType::NoFlush); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate more for the reason that the close()
is added here? I don't see from your initial fix, and I feel a user is possible to close the connection with other type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is because if we don't close it, tests fail because of the ASSERT
in:
envoy/source/common/network/connection_impl.cc
Lines 116 to 125 in b1e3351
ConnectionImpl::~ConnectionImpl() { | |
ASSERT(!socket_->isOpen() && delayed_close_timer_ == nullptr, | |
"ConnectionImpl was unexpectedly torn down without being closed."); | |
// In general we assume that owning code has called close() previously to the destructor being | |
// run. This generally must be done so that callbacks run in the correct context (vs. deferred | |
// deletion). Hence the assert above. However, call close() here just to be completely sure that | |
// the fd is closed and make it more likely that we crash from a bad close callback. | |
close(ConnectionCloseType::NoFlush); | |
} |
The comment in that code section explains the expectation out of the owner of the ConnectionImpl
:
In general we assume that owning code has called close() previously to the destructor being run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this is to mitigate the wrong usage of this client in the test filter to simulate the callback issue, and ideally this should not happen. It needs to be fixed in the usage part. FWIW, the original version of fix LGTM, and we can change how we test it or remove the newly added tests, either works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes differ in the close
call, is there a reason why we shouldn't include it, regardless of the testing?
Removing the newly added test meaning no integration test coverage for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason you see the non-close error is because we force release the async_client before calling any close as the wrong usage, and that is also the assertion there. If we really want to keep the close here to cover this should-not-happen case, adding a close boolean for close may be less risky. I am more worried that if a caller calls close(FlushWriteAndDelay)
and then following with a close(NoFlush)
when the connection is still in the defer deletion list, the previous one will be overrode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!
/wait |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks neat, overall LGTM module one concern for the recent change.
connection_->removeConnectionCallbacks(*this); | ||
} | ||
|
||
close(Network::ConnectionCloseType::NoFlush); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason you see the non-close error is because we force release the async_client before calling any close as the wrong usage, and that is also the assertion there. If we really want to keep the close here to cover this should-not-happen case, adding a close boolean for close may be less risky. I am more worried that if a caller calls close(FlushWriteAndDelay)
and then following with a close(NoFlush)
when the connection is still in the defer deletion list, the previous one will be overrode.
@botengyao thanks for the review. I think what you're saying makes sense. The My suggestion is modifying
So the condition would be |
@ohadvano, sgtm, a close boolean works, thanks! |
Signed-off-by: Ohad Vano <ohadvano@gmail.com>
ping @ggreenway @botengyao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…yproxy#35410) When the ``AsyncTcpClient`` is being destroyed but it also has an active client connection, there's a crash since during the instance destruction, the ``ClientConnection`` object would also be destroyed, causing ``raiseEvent`` to be called back to ``AsyncTcpClient`` while it is being destroyed Caught with the following stack trace: ``` Caught Segmentation fault, suspect faulting address 0x0 Backtrace (use tools/stack_decode.py to get line numbers): Envoy version: ee8c765a07037033766ea556c032120b497152b3/1.27.0/Clean/RELEASE/BoringSSL #0: __restore_rt [0x7d80ab903420] envoyproxy#1: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::onEvent() [0x58313528746b] envoyproxy#2: Envoy::Tcp::AsyncTcpClientImpl::onEvent() [0x5831359da00a] envoyproxy#3: Envoy::Network::ConnectionImplBase::raiseConnectionEvent() [0x583135f0521d] envoyproxy#4: Envoy::Network::ConnectionImpl::raiseEvent() [0x583135e9fed9] envoyproxy#5: Envoy::Network::ConnectionImpl::closeSocket() [0x583135e9f90c] envoyproxy#6: Envoy::Network::ConnectionImpl::close() [0x583135e9e54c] envoyproxy#7: Envoy::Network::ConnectionImpl::~ConnectionImpl() [0x583135e9de5c] envoyproxy#8: Envoy::Network::ClientConnectionImpl::~ClientConnectionImpl() [0x5831355fd25e] envoyproxy#9: Envoy::Tcp::AsyncTcpClientImpl::~AsyncTcpClientImpl() [0x5831359da247] envoyproxy#10: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::~FluentdAccessLoggerImpl() [0x583135289350] envoyproxy#11: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLog::ThreadLocalLogger::~ThreadLocalLogger() [0x58313528adbf] envoyproxy#12: Envoy::ThreadLocal::InstanceImpl::shutdownThread() [0x58313560373a] envoyproxy#13: Envoy::Server::WorkerImpl::threadRoutine() [0x583135630c0a] envoyproxy#14: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()envoyproxy#1}::__invoke() [0x5831364e88d5] envoyproxy#15: start_thread [0x7d80ab8f7609] ``` Risk Level: low Testing: unit tests Docs Changes: none Release Notes: none Platform Specific Features: none --------- Signed-off-by: Ohad Vano <ohadvano@gmail.com> Signed-off-by: Martin Duke <martin.h.duke@gmail.com>
…yproxy#35410) When the ``AsyncTcpClient`` is being destroyed but it also has an active client connection, there's a crash since during the instance destruction, the ``ClientConnection`` object would also be destroyed, causing ``raiseEvent`` to be called back to ``AsyncTcpClient`` while it is being destroyed Caught with the following stack trace: ``` Caught Segmentation fault, suspect faulting address 0x0 Backtrace (use tools/stack_decode.py to get line numbers): Envoy version: ee8c765a07037033766ea556c032120b497152b3/1.27.0/Clean/RELEASE/BoringSSL #0: __restore_rt [0x7d80ab903420] envoyproxy#1: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::onEvent() [0x58313528746b] envoyproxy#2: Envoy::Tcp::AsyncTcpClientImpl::onEvent() [0x5831359da00a] envoyproxy#3: Envoy::Network::ConnectionImplBase::raiseConnectionEvent() [0x583135f0521d] envoyproxy#4: Envoy::Network::ConnectionImpl::raiseEvent() [0x583135e9fed9] envoyproxy#5: Envoy::Network::ConnectionImpl::closeSocket() [0x583135e9f90c] envoyproxy#6: Envoy::Network::ConnectionImpl::close() [0x583135e9e54c] envoyproxy#7: Envoy::Network::ConnectionImpl::~ConnectionImpl() [0x583135e9de5c] envoyproxy#8: Envoy::Network::ClientConnectionImpl::~ClientConnectionImpl() [0x5831355fd25e] envoyproxy#9: Envoy::Tcp::AsyncTcpClientImpl::~AsyncTcpClientImpl() [0x5831359da247] envoyproxy#10: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLoggerImpl::~FluentdAccessLoggerImpl() [0x583135289350] envoyproxy#11: Envoy::Extensions::AccessLoggers::Fluentd::FluentdAccessLog::ThreadLocalLogger::~ThreadLocalLogger() [0x58313528adbf] envoyproxy#12: Envoy::ThreadLocal::InstanceImpl::shutdownThread() [0x58313560373a] envoyproxy#13: Envoy::Server::WorkerImpl::threadRoutine() [0x583135630c0a] envoyproxy#14: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()envoyproxy#1}::__invoke() [0x5831364e88d5] envoyproxy#15: start_thread [0x7d80ab8f7609] ``` Risk Level: low Testing: unit tests Docs Changes: none Release Notes: none Platform Specific Features: none --------- Signed-off-by: Ohad Vano <ohadvano@gmail.com> Signed-off-by: asingh-g <abhisinghx@google.com>
Commit Message: when the
AsyncTcpClient
is being destroyed but it also has an active client connection, there's a crash since during the instance destruction, theClientConnection
object would also be destroyed, causingraiseEvent
to be called back toAsyncTcpClient
while it is being destroyedAdditional Description: Caught with the following stack trace:
Risk Level: low
Testing: unit tests
Docs Changes: none
Release Notes: none
Platform Specific Features: none