Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Sep 20, 2016

What changes were proposed in this pull request?

Backport #13513 to branch 2.0.

How was this patch tested?

Jenkins

…ataLog in FileStreamSource

Current `metadataLog` in `FileStreamSource` will add a checkpoint file in each batch but do not have the ability to remove/compact, which will lead to large number of small files when running for a long time. So here propose to compact the old logs into one file. This method is quite similar to `FileStreamSinkLog` but simpler.

Unit test added.

Author: jerryshao <sshao@hortonworks.com>

Closes #13513 from jerryshao/SPARK-15698.

(cherry picked from commit a6aade0)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
* old files while another one keeps retrying. Setting a reasonable cleanup delay could avoid it.
*/
private val fileCleanupDelayMs = sparkSession.conf.get(SQLConf.FILE_SINK_LOG_CLEANUP_DELAY)
protected override val fileCleanupDelayMs =
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just resolved conflicts for these 3 confs

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65673 has finished for PR 15163 at commit 346def7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class CompactibleFileStreamLog[T: ClassTag](
    • class FileStreamSinkLog(
    • case class FileEntry(path: String, timestamp: Timestamp, batchId: Long = NOT_SET)
    • class FileStreamSourceLog(

@zsxwing
Copy link
Member Author

zsxwing commented Sep 20, 2016

Merging to 2.0 since tests passed.

asfgit pushed a commit that referenced this pull request Sep 20, 2016
…ataLog in FileStreamSource (branch-2.0)

## What changes were proposed in this pull request?

Backport #13513 to branch 2.0.

## How was this patch tested?

Jenkins

Author: jerryshao <sshao@hortonworks.com>

Closes #15163 from zsxwing/SPARK-15698-spark-2.0.
@zsxwing zsxwing closed this Sep 20, 2016
@zsxwing zsxwing deleted the SPARK-15698-spark-2.0 branch January 4, 2017 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants