Skip to content

Conversation

@misutoth
Copy link
Contributor

@misutoth misutoth commented Sep 4, 2018

What changes were proposed in this pull request?

Reproduce File Sink duplication in driver failure scenario to help understanding the situation.
Propose a new StagingFileCommitProtocol that creates the target files in a staging subdirectory and upon job commit it moves all the files to the target directory. The target file names of a task will be the same in 2 different runs. This way a potential source for duplication (when same content is placed into 2 files with different names) is eliminated.

How was this patch tested?

Created specific unit test for the new protocol: StagingFileCommitProtocolSuite
Ran the test that reproduced the problem: FileStreamSinkUnitSuite
Made FileStreamStressSuite more strict to demand exactly once delivery.
Tested on a 4 machine cluster sending 30000 messages 20 times killing the driver. Each message was delivered exactly once.
Ran tests for sql with sbt.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@misutoth misutoth changed the title Tests for idempotency of FileStreamSink - Work in Progress [SPARK-25331][SS][WIP] Tests for exactly once guarantee of FileStreamSink Sep 4, 2018
@misutoth misutoth changed the title [SPARK-25331][SS][WIP] Tests for exactly once guarantee of FileStreamSink [SPARK-25331][SS] Make FileStreamSink ignore partitions of batches that have already been written to file system Sep 19, 2018
@misutoth
Copy link
Contributor Author

@rxin could you please look into this change?

@misutoth
Copy link
Contributor Author

@lw-lin , @marmbrus , in the meantime I found that you have been discussing about having deterministic file names in a PR. Could you please tell those cases?

I was just thinking if it is a reasonable expectation from a sink's point of view to receive the same data partitioned the same way if it is actually the same batch?

@gaborgsomogyi, you may also be interested in this change.

@gaborgsomogyi
Copy link
Contributor

I've taken a look at the things and I think the issue solved in the mentioned PR but not yet documented. If somebody would like to use the output directory of a spark application which uses a file sink (with exactly-once), then it must read the metadata first to get the list of valid files.

Considering these this PR can be closed.

@misutoth
Copy link
Contributor Author

misutoth commented Dec 7, 2018

So I am considering this as the recommended way to read a file sink's output. If there is a need to include the protocol in this PR as an alternative we can still reopen it.

@misutoth misutoth closed this Dec 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants