[SPARK-25331][SS] Make FileStreamSink ignore partitions of batches that have already been written to file system #22331

misutoth · 2018-09-04T14:25:25Z

What changes were proposed in this pull request?

Reproduce File Sink duplication in driver failure scenario to help understanding the situation.
Propose a new StagingFileCommitProtocol that creates the target files in a staging subdirectory and upon job commit it moves all the files to the target directory. The target file names of a task will be the same in 2 different runs. This way a potential source for duplication (when same content is placed into 2 files with different names) is eliminated.

How was this patch tested?

Created specific unit test for the new protocol: StagingFileCommitProtocolSuite
Ran the test that reproduced the problem: FileStreamSinkUnitSuite
Made FileStreamStressSuite more strict to demand exactly once delivery.
Tested on a 4 machine cluster sending 30000 messages 20 times killing the driver. Each message was delivered exactly once.
Ran tests for sql with sbt.

AmplabJenkins · 2018-09-04T14:28:06Z

Can one of the admins verify this patch?

…at have already been written to file system

...core/src/main/scala/org/apache/spark/sql/execution/streaming/StagingFileCommitProtocol.scala

…ry has not yet been created.

misutoth · 2018-09-25T14:00:04Z

@rxin could you please look into this change?

misutoth · 2018-09-28T20:27:46Z

@lw-lin , @marmbrus , in the meantime I found that you have been discussing about having deterministic file names in a PR. Could you please tell those cases?

I was just thinking if it is a reasonable expectation from a sink's point of view to receive the same data partitioned the same way if it is actually the same batch?

@gaborgsomogyi, you may also be interested in this change.

gaborgsomogyi · 2018-11-14T16:01:07Z

I've taken a look at the things and I think the issue solved in the mentioned PR but not yet documented. If somebody would like to use the output directory of a spark application which uses a file sink (with exactly-once), then it must read the metadata first to get the list of valid files.

Considering these this PR can be closed.

misutoth · 2018-12-07T15:59:06Z

So I am considering this as the recommended way to read a file sink's output. If there is a need to include the protocol in this PR as an alternative we can still reopen it.

Tests for idempotency of FileStreamSink

0a5c6c4

misutoth changed the title ~~Tests for idempotency of FileStreamSink - Work in Progress~~ [SPARK-25331][SS][WIP] Tests for exactly once guarantee of FileStreamSink Sep 4, 2018

[SPARK-25331][SS] Make FileStreamSink ignore partitions of batches th…

aa15cbe

…at have already been written to file system

misutoth changed the title ~~[SPARK-25331][SS][WIP] Tests for exactly once guarantee of FileStreamSink~~ [SPARK-25331][SS] Make FileStreamSink ignore partitions of batches that have already been written to file system Sep 19, 2018

HeartSaVioR reviewed Sep 21, 2018

View reviewed changes

...core/src/main/scala/org/apache/spark/sql/execution/streaming/StagingFileCommitProtocol.scala Show resolved Hide resolved

[SPARK-25331][SS] Make newTaskTempFile() fail fast if staging directo…

92247a3

…ry has not yet been created.

misutoth closed this Dec 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-25331][SS] Make FileStreamSink ignore partitions of batches that have already been written to file system #22331

[SPARK-25331][SS] Make FileStreamSink ignore partitions of batches that have already been written to file system #22331

Uh oh!

misutoth commented Sep 4, 2018 •

edited

Loading

Uh oh!

AmplabJenkins commented Sep 4, 2018

Uh oh!

Uh oh!

misutoth commented Sep 25, 2018

Uh oh!

misutoth commented Sep 28, 2018

Uh oh!

gaborgsomogyi commented Nov 14, 2018

Uh oh!

misutoth commented Dec 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-25331][SS] Make FileStreamSink ignore partitions of batches that have already been written to file system #22331

[SPARK-25331][SS] Make FileStreamSink ignore partitions of batches that have already been written to file system #22331

Uh oh!

Conversation

misutoth commented Sep 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Sep 4, 2018

Uh oh!

Uh oh!

misutoth commented Sep 25, 2018

Uh oh!

misutoth commented Sep 28, 2018

Uh oh!

gaborgsomogyi commented Nov 14, 2018

Uh oh!

misutoth commented Dec 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

misutoth commented Sep 4, 2018 •

edited

Loading