Fix for Remote port forwarding buffers can grow without limits (issue #658) #913
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR addresses #658.
Problem
The
Buffer
class can grow out of control when its consumer is unable to read data quickly enough to keep up with the producer for an extended period of time. This results in an out of memory (OOM) condition during ssh port forwarding. This issue has been hit in production at Delphix many times.Diagnosis
When
ChannelInputStream
receives new data from its producer, it callsBuffer.putRawBytes
which will double its internal array if there is not enough room to store data starting at its current write positionwpos
. Meanwhile, when the consumer reads data fromChannelInputStream
, the buffer will advance its read positionrpos
to the next unread byte. If, and only if, the read position catches up with the write position,ChannelInputStream
will callBuffer.clear
to reset both positions to0
.However, if for an extended amount of time the consumer is slower than the producer and the latter produces a large amount of data, the buffer will continue to double itself until it cannot fit in memory anymore, causing an OOM.
Solution
The key issue with this behavior of
Buffer
is that the amount of data betweenrpos
andwpos
can be relatively small compared to the space in the array beforerpos
, which is being wasted. Yet, as long asrpos
doesn't catch up withwpos
(i.e. as long asBuffer.clear
does not get called),Buffer
will continue to add new data afterwpos
and grow the array past its current end -- no matter how much space beforerpos
is available.This PR proposes to use a circular buffer instead, so that the space before
rpos
is no longer wasted and the buffer no longer needs to grow beyond any limits while utilizing only a fraction of the necessary space. Also, the circular buffer can grow just likeBuffer
when more capacity is needed, but now it throws an exception when a configurable size limit is reached instead of potentially triggering an OOM (a scenario that is a lot less likely anyway because the space beforerpos
is now being utilized).At first, I thought of refactoring the entire
Buffer
class to make it circular, but there are too many uses ofBuffer
in the code base, too many operations in the interface ofBuffer
whose implementation would need to change, and perhaps more critically too many instances of the internaldata
,rpos
andwpos
fields being exposed via getters to outside manipulation (e.g.PacketReader.readPacket
which callsBuffer.array
to get the internaldata
array) to make this practical and safe from side effects.My proposed solution instead is to focus on the
ChannelInputStream
use case ofBuffer
, which implements only a subset of allBuffer
methods, and create a newCircularBuffer
class that implements only whatChannelInputStream
needs, while solving the race condition that causes this OOM. As far as I can tell,ChannelInputStream
appears to be the only use ofBuffer
where this race condition can happen. We use SSHJ at Delphix for all SSH needs and we have only hit this issue in that use case. (If other parts of SSHJ turn out to be affected, we can always expandCircularBuffer
later to be used there as well.)Testing in Controlled Environment
In a controlled environment, the issue can be reproduced by placing a debugger breakpoint at
ChannelInputStream.read
to temporarily stop the consumption of data (which movesrpos
closer towpos
) whileChannelInputStream.receive
continues to receive new data (which moveswpos
farther fromrpos
and eventually forces buffer expansions).After this fix, we can see with this breakpoint that the underlying array is being fully utilized and that it doesn't get expanded until there is truly no more space in it.
Just for reference, I include in this diff the test
RemotePFPerformanceTest
which allowed me to reproduce the issue locally. It's disabled, however, and I might just remove it in the end, because of its heavy impact on the performance of the system running it.Testing in Production
In production, we hit this issue by using port forwarding to read a very large file on one end and write it to disk on the other end, usually when disk writes are slowed down by other processes also writing to it.
After applying this patch on top of SSHJ, we have been running the fix for almost three years at Delphix (including re-applying it on top of each new release of SSHJ). We haven't hit this issue again and we haven't experienced any negative side effects.