Skip to content

Conversation

@attilapiros
Copy link
Contributor

What changes were proposed in this pull request?

Wrapping JHashMap[BlockId, BlockStatus] (used in blockStatusByShuffleService) into a new class BlockStatusPerBlockId which removes the reference to the map when all the persisted blocks are removed.

Why are the changes needed?

With #32790 a bug is introduced when all the persisted blocks are removed we remove the HashMap which already shared by the block manger infos but when new block is persisted this map is needed to be used again for storing the data (and this HashMap must be the same which shared by the block manger infos created for registered block managers running on the same host where the external shuffle service is).

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Extending BlockManagerInfoSuite with test which removes all the persisted blocks then adds another one.

@github-actions github-actions bot added the CORE label Jun 22, 2021
@attilapiros
Copy link
Contributor Author

cc @mridulm

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Test build #140133 has finished for PR 33020 at commit 3537e9c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44661/

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Test build #140135 has finished for PR 33020 at commit b78f726.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44661/

@attilapiros
Copy link
Contributor Author

attilapiros commented Jun 22, 2021

build error is unrelated:

[error] ## Exception when compiling 480 sources to
/home/jenkins/workspace/SparkPullRequestBuilder@2/sql/catalyst/target/scala-2.12/classes
[error] java.lang.StackOverflowError
[error] scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
[error] scala.tools.nsc.transform.CleanUp$CleanUpTransformer.transform(CleanUp.scala:607)
[error] scala.tools.nsc.transform.CleanUp$CleanUpTransformer.transform(CleanUp.scala:37)
[error] scala.reflect.api.Trees$Transformer.$anonfun$transformStats$1(Trees.scala:2597)
[error] scala.reflect.api.Trees$Transformer.transformStats(Trees.scala:2595)
[error] scala.reflect.internal.Trees.itransform(Trees.scala:1430)
[error] scala.reflect.internal.Trees.itransform$(Trees.scala:1400)

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44662/

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44665/

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44665/

@gengliangwang
Copy link
Member

@attilapiros Thanks for coming up with a better fix.
cc @Ngone51 as well

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Test build #140138 has finished for PR 33020 at commit b78f726.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros attilapiros changed the title [WIP][SPARK-35543][CORE][FOLLOWUP] Fix memory leak in BlockManagerMasterEndpoint removeRdd [SPARK-35543][CORE][FOLLOWUP] Fix memory leak in BlockManagerMasterEndpoint removeRdd Jun 22, 2021
Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @attilapiros !

@SparkQA
Copy link

SparkQA commented Jun 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44731/

@SparkQA
Copy link

SparkQA commented Jun 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44731/

@SparkQA
Copy link

SparkQA commented Jun 23, 2021

Test build #140203 has finished for PR 33020 at commit 9d8d357.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 0bdece0 Jun 24, 2021
@mridulm
Copy link
Contributor

mridulm commented Jun 24, 2021

Thanks for fixing this @attilapiros !
Thanks for the reviews @dongjoon-hyun, @Ngone51 :-)

srowen pushed a commit that referenced this pull request Apr 17, 2022
…ching

### What changes were proposed in this pull request?

Fixes a bug where if `spark.shuffle.service.fetch.rdd.enabled=true`, memory-only cached blocks will fail to unpersist.

### Why are the changes needed?

In #33020, when all RDD blocks are removed from `externalShuffleServiceBlockStatus`, the underlying Map is nulled to reduce memory. When persisting blocks we check if it's using disk before adding it to `externalShuffleServiceBlockStatus`, but when removing them there is no check, so a memory-only cache block will keep `externalShuffleServiceBlockStatus` null, and when unpersisting it throw an NPE because it tries to remove from the null Map. This adds checks to the removal as well to only remove if the block is on disk, and therefore should have been added to `externalShuffleServiceBlockStatus` in the first place.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New and updated UT

Closes #35959 from Kimahriman/fetch-rdd-memory-only-unpersist.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
srowen pushed a commit that referenced this pull request Apr 17, 2022
…ching

### What changes were proposed in this pull request?

Fixes a bug where if `spark.shuffle.service.fetch.rdd.enabled=true`, memory-only cached blocks will fail to unpersist.

### Why are the changes needed?

In #33020, when all RDD blocks are removed from `externalShuffleServiceBlockStatus`, the underlying Map is nulled to reduce memory. When persisting blocks we check if it's using disk before adding it to `externalShuffleServiceBlockStatus`, but when removing them there is no check, so a memory-only cache block will keep `externalShuffleServiceBlockStatus` null, and when unpersisting it throw an NPE because it tries to remove from the null Map. This adds checks to the removal as well to only remove if the block is on disk, and therefore should have been added to `externalShuffleServiceBlockStatus` in the first place.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New and updated UT

Closes #35959 from Kimahriman/fetch-rdd-memory-only-unpersist.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit e0939f0)
Signed-off-by: Sean Owen <srowen@gmail.com>
Kimahriman added a commit to Kimahriman/spark that referenced this pull request Apr 10, 2023
…ching

### What changes were proposed in this pull request?

Fixes a bug where if `spark.shuffle.service.fetch.rdd.enabled=true`, memory-only cached blocks will fail to unpersist.

### Why are the changes needed?

In apache#33020, when all RDD blocks are removed from `externalShuffleServiceBlockStatus`, the underlying Map is nulled to reduce memory. When persisting blocks we check if it's using disk before adding it to `externalShuffleServiceBlockStatus`, but when removing them there is no check, so a memory-only cache block will keep `externalShuffleServiceBlockStatus` null, and when unpersisting it throw an NPE because it tries to remove from the null Map. This adds checks to the removal as well to only remove if the block is on disk, and therefore should have been added to `externalShuffleServiceBlockStatus` in the first place.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New and updated UT

Closes apache#35959 from Kimahriman/fetch-rdd-memory-only-unpersist.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit e0939f0)
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants