-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-5545][STREAMING] DStream#saveAs**Files can fail after app restart... #4322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5545][STREAMING] DStream#saveAs**Files can fail after app restart... #4322
Conversation
…arts. This happens when an RDD is partially written when the driver fails. At this point, the directory that the RDD is being written to gets reused when the app restarts. This can cause the write to fail, since the underlying MR API does not allow writing to a directory that already exists. This PR introduces a new method in RDD.scala that allows overwriting any existing data in a directory. This method is prive[spark] and is used by Spark Streaming to overwrite any data present in the directory.
|
Test build #26581 has started for PR 4322 at commit
|
|
Test build #26581 has finished for PR 4322 at commit
|
|
Test FAILed. |
|
Test build #26582 has started for PR 4322 at commit
|
|
This seems similar to #3832 / https://issues.apache.org/jira/browse/SPARK-4835. |
|
Test build #26582 has finished for PR 4322 at commit
|
|
Test FAILed. |
|
Jenkins, test this please |
|
Test build #26606 has started for PR 4322 at commit
|
|
Was already fixed by #3832 |
|
Test build #26606 has finished for PR 4322 at commit
|
|
Test PASSed. |
...s.
This happens when an RDD is partially written when the driver fails. At this point,
the directory that the RDD is being written to gets reused when the app restarts.
This can cause the write to fail, since the underlying MR API does not allow writing
to a directory that already exists.
This PR introduces a new method in RDD.scala that allows overwriting any existing data
in a directory. This method is prive[spark] and is used by Spark Streaming to overwrite
any data present in the directory.