-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs #3832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #24874 has started for PR 3832 at commit
|
|
The |
|
Test build #24874 has finished for PR 3832 at commit
|
|
Test PASSed. |
docs/configuration.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to say "jobs generated through Spark Streaming's StreamingContext" rather than "scheduler" and stuff.
|
Few minor comments, almost good to go. |
|
@pwendell You should take a look as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is not USED by Spark Streaming. Its used by Spark, but it allows higher-level frameworks (like Spark Streaming) to override the behavior. Also this comment should not go into the details of why Spark Streaming updates it. Just refer to JIRA, and the Spark Streaming;s scheduler code change can do the explaining of the issue. This keeps Spark code from referring to too much of Spark Streaming.
|
I'd be glad to add a test here, although this might be a little tricky since the old behavior resulted in silent failures; I should be able to come up with a test though. Regarding the streaming-specific
Which of these makes more sense? I think that option 2 is a better backwards-compatibility escape hatch / flag. |
|
In (2), what is the default setting of the 2 params? Somehow I am finding (2) more confusing than (1). Why do you think 2 is better for backwards compatibility? The behavior earlier was actually buggy, that needs to be solved. |
|
Since the old behavior was buggy, can we just omit the |
|
So basically you want to split this PR into two - (1) fix the bug, and (2) introduce an override. I like the idea. Lets do that. |
|
Test build #25003 has started for PR 3832 at commit
|
|
@tdas I've updated this PR and added a test case. My test case uses calls inside of a I wasn't able to get the existing "recovery with saveAsNewAPIHadoopFiles operation" test to fail, though, even though I discovered this bug while refactoring that test in my other PR. I think that the issue is that the failed I was about to write that this bug might not actually affect users who don't use |
|
Test build #25003 has finished for PR 3832 at commit
|
|
Test PASSed. |
|
Yes, this is hardly a valid used case of |
|
The new test added here always failed before the fix in this patch. I agree that it's a bit convoluted; it would be better to catch this bug in the original |
|
Alright, I've pushed one more commit that explains the confusing |
|
Test build #25040 has started for PR 3832 at commit
|
|
Test build #25040 has finished for PR 3832 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra space before However
|
There are few nits, but this LGTM even without it. I am merging this. Thanks for this finding and solving this! |
This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery. Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists. SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat. In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times. In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions. When output spec. validation is enabled, the second calls to these actions will fail due to existing output. This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler. This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable. Author: Josh Rosen <joshrosen@databricks.com> Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits: 36eaf35 [Josh Rosen] Add comment explaining use of transform() in test. 6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform() 7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming. e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic. 762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs. (cherry picked from commit 939ba1f) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery.
Hadoop OutputFormats have a
checkOutputSpecsmethod which performs certain checks prior to writing output, such as checking whether the output directory already exists. SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat.In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so
saveactions may be called multiple times. In addition toDStream's own save actions, users might usetransformorforeachRDDand call theRDDandPairRDDsave actions. When output spec. validation is enabled, the second calls to these actions will fail due to existing output.This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler. This is done by using Scala's
DynamicVariableto propagate the bypass setting without having to mutate SparkConf or introduce a global variable.