[SPARK-39702][CORE] Reduce memory overhead of TransportCipher$EncryptedMessage by using a shared byteRawChannel #37110

JoshRosen · 2022-07-07T01:32:22Z

What changes were proposed in this pull request?

This patch aims to reduce the memory overhead of TransportCipher$EncryptedMessage. In the current code, the EncryptedMessage constructor eagerly initializes a ByteArrayWritableChannel byteRawChannel (which consumes ~32kb of memory). If there are many EncryptedMessage instances on the heap (e.g. because there is a long queue of outgoing messages on a channel) then this overhead adds up and can cause OOMs or GC problems.

SPARK-24801 / #21811 fixed a similar issue in SaslEncryption. There, the fix was to lazily initialize the buffer: the buffer isn't actually accessed before transferTo() is called (and is only used there), so lazily initializing it there reduces memory requirements for queued outgoing messages.

In principle we could apply a similar lazy initialization fix here. In this PR, however, I have taken a different approach: I construct a single shared ByteArrayWritableChannel byteRawChannel in TransportChannel$EncryptionHandler and pass that shared instance to the EncryptedMessage constructor. I believe that this is safe because we are already doing this for the byteEncChannel channel buffer. That shared byteEncChannel gets reset() when EncryptedMessage.deallocate() is called. If we assume that existing sharing is correct then I think it's okay to apply similar sharing of the byteRawChannel buffer because its scope of use and lifecycle is similar.

Why are the changes needed?

Improve performance and reduce a source of OOMs when encryption is enabled.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Correctness: Existing unit tests.

PerformanceI: observed memory usage and performance improvements by running an artificial workload that significantly stresses the shuffle sending path. On a two-host Spark Standalone cluster where each host had an external shuffle service (with 1gb heap) and a 64-core executor, I ran the following code:

val numMapTasks = 25000
val numReduceTasks = 256
val random = new java.util.Random()
val inputData = spark.range(1, numMapTasks * numReduceTasks, 1, numMapTasks).map { x =>
  val bytes = new Array[Byte](10 * 1024)
  random.nextBytes(bytes)
  bytes
}
inputData.repartition(numReduceTasks).write.mode("overwrite").format("noop").save()

Prior to this patch, this job reliably failed because the Worker (where the shuffle service runs) would fill its heap and go into long GC pauses, eventually causing it to become disassociated from the Master. After this patch's changes, this job smoothly runs to completion.

jiangxb1987

Thanks for fixing this issue!

dongjoon-hyun

+1, LGTM. Thank you, @JoshRosen and @jiangxb1987 .
Merged to master for Apache Spark 3.4.0.

Squashed patch export.

d1afb0f

JoshRosen requested review from vanzin and zsxwing July 7, 2022 01:32

github-actions bot added the CORE label Jul 7, 2022

JoshRosen requested a review from felixcheung July 7, 2022 02:02

jiangxb1987 approved these changes Jul 7, 2022

View reviewed changes

dongjoon-hyun approved these changes Jul 10, 2022

View reviewed changes

dongjoon-hyun closed this in 06ee7c2 Jul 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39702][CORE] Reduce memory overhead of TransportCipher$EncryptedMessage by using a shared byteRawChannel #37110

[SPARK-39702][CORE] Reduce memory overhead of TransportCipher$EncryptedMessage by using a shared byteRawChannel #37110

Uh oh!

JoshRosen commented Jul 7, 2022 •

edited

Loading

Uh oh!

jiangxb1987 left a comment

Uh oh!

dongjoon-hyun left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-39702][CORE] Reduce memory overhead of TransportCipher$EncryptedMessage by using a shared byteRawChannel #37110

[SPARK-39702][CORE] Reduce memory overhead of TransportCipher$EncryptedMessage by using a shared byteRawChannel #37110

Uh oh!

Conversation

JoshRosen commented Jul 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JoshRosen commented Jul 7, 2022 •

edited

Loading