Skip to content

Conversation

@chaoqin-li1123
Copy link
Contributor

What changes were proposed in this pull request?

When RocksDB version.zip file get overwritten (e.g. concurrent task execution, task/stage/batch reattempts) or the zip file don't get uploaded successfully, the associated sst and log files don't get garbage collected. We can clean up these SST files during periodic state store maintenance. The major concern is that sst files for ongoing version also appear to be "orphan" because they are uploaded before zip file, we have to be careful not to delete them. To be conservative, we only delete orphan files older than all files tracked by current metadata when there are at least 2 versions.

Why are the changes needed?

When RocksDB version.zip file get overwritten (e.g. concurrent task execution, task/stage/batch reattempts) or the zip file don't get uploaded successfully, the associated sst and log files don't get garbage collected.(https://github.com/databricks/runtime/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala) These files consume storage.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add unit test.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nit. Thanks! Shall we remove WIP in the PR title as it seems to be ready to be reviewed and merged?

@chaoqin-li1123
Copy link
Contributor Author

@rangadi

@chaoqin-li1123 chaoqin-li1123 changed the title [WIP][SPARK-42353][SS] Cleanup orphan sst and log files in RocksDB checkpoint directory [SPARK-42353][SS] Cleanup orphan sst and log files in RocksDB checkpoint directory Feb 9, 2023
Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 pending build.

@rangadi
Copy link

rangadi commented Feb 9, 2023

+1.

@HeartSaVioR
Copy link
Contributor

Thanks! Merging to master.

Kimahriman pushed a commit to Kimahriman/spark that referenced this pull request Aug 8, 2023
…int directory

### What changes were proposed in this pull request?
When RocksDB version.zip file get overwritten (e.g. concurrent task execution, task/stage/batch reattempts) or the zip file don't get uploaded successfully, the associated sst and log files don't get garbage collected. We can clean up these SST files during periodic state store maintenance. The major concern is that sst files for ongoing version also appear to be "orphan" because they are uploaded before zip file, we have to be careful not to delete them. To be conservative, we only delete orphan files older than all files tracked by current metadata when there are at least 2 versions.

### Why are the changes needed?
When RocksDB version.zip file get overwritten (e.g. concurrent task execution, task/stage/batch reattempts) or the zip file don't get uploaded successfully, the associated sst and log files don't get garbage collected.([https://github.com/databricks/runtime/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala](https://github.com/databricks/runtime/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala#L305-L309)) These files consume storage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add unit test.

Closes apache#39897 from chaoqin-li1123/orphan_cleanup.

Authored-by: Chaoqin Li <chaoqin.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants