-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19633][SS] FileSource read from FileSink #16987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #73124 has started for PR 16987 at commit |
|
Jenkins retest this please |
|
Test build #73126 has finished for PR 16987 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to keep track of the file name for later checking
|
Thanks for working on this, however I'm not sure if we want to go with this approach. In Spark 2.2, I think we should consider deprecating the manifest files and instead use deterministic file names to get exactly once semantics. |
|
Using deterministic file names sounds great. Thanks! I'm closing this. |
|
I spoke too soon, sorry! Thinking about it more the deterministic filename solution is not great as the number of partitions could change for several reasons. Given that would you mind reopening this? /cc @zsxwing do you have time to review? |
|
Reopening :-) |
|
Test build #73374 has finished for PR 16987 at commit
|
|
retest this please |
|
Test build #73382 has finished for PR 16987 at commit
|
zsxwing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Could you rewrite the tests to use real streaming queries rather than modifying the log manually? It's better to have two queries, one is writing to FileSink, the other is reading from the same folder using FileSource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess sourceHasMetadata is generated here is because of hasMetadata. Could you move hasMetadata to object FileStreamSink? Then you can do it inside FileStreamSource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea hasMetadata was the reason! Now it lives in object FileStreamSink :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can merge the latest master and use test directly. Not need to use testWithUninterruptibleThread after #16947
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done; thanks! and good job for #16947!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
b66d2cc to
d31cb76
Compare
|
Test build #73492 has finished for PR 16987 at commit
|
|
Rebased to master and tests updated. @zsxwing would you take another look when you've got a minute? |
| /** | ||
| * If the source has a metadata log indicating which files should be read, then we should use it. | ||
| * We figure out whether there exists some metadata log only when user gives a non-glob path. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just found one corner case: if the query to write files has not yet started, the current folder will contain no files even it's an output folder of the file sink. I think we should always call sourceHasMetadata until the folder is not empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, why not just change sourceHasMetadata to a method? sparkSession.sessionState.newHadoopConf() seems expensive but we can save it into a field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah thanks! I was about to change it to a method which would stop detecting once we know for sure to use a metadatafileindex or a inmemoryfileindex and remember this information. will push an udpate soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and add a dedicated test case of course
| * | ||
| * None means we don't know at the moment | ||
| * Some(true) means we know for sure the source DOES have metadata | ||
| * Some(false) means we know for sure the source DOSE NOT have metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( some notes here since the changes are not trival )
here we're using this sourceHasMetadata to indicate whether we know for sure the source has metadata, as stated in the source file comments:
Nonemeans we don't know at the momentSome(true)means we know for sure the source DOES have metadataSome(false)means we know for sure the source DOSE NOT have metadata
| // Note if `sourceHasMetadata` holds, then `qualifiedBasePath` is guaranteed to be a | ||
| // non-glob path | ||
| new MetadataLogFileIndex(sparkSession, qualifiedBasePath) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then based on sourceHasMetadata's value, we can choose which FileIndex should be used. As shown below, case None requires most of the care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like sourceHasMetadata match { case ... } is more appropriate here
| val sources = query.get.logicalPlan.collect { | ||
| case StreamingExecutionRelation(source, _) if source.isInstanceOf[FileStreamSource] => | ||
| source.asInstanceOf[FileStreamSource] | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this common logic is extracted out
|
Test build #73578 has finished for PR 16987 at commit
|
zsxwing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Left some style comments.
| } | ||
|
|
||
| /** Execute arbitrary code */ | ||
| case class Execute(val func: StreamExecution => Any) extends StreamAction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just make this extend AssertOnQuery to avoid adding new case clause to testStream which is already pretty long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks!
| withSQLConf(SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key -> "3") { | ||
| withTempDirs { case (dir, tmp) => | ||
| // q1 is a streaming query that reads from memory and writes to text files | ||
| val q1_source = MemoryStream[String] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please don't use _ in a variable name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
|
||
| test("read data from outputs of another streaming query") { | ||
| withSQLConf(SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key -> "3") { | ||
| withTempDirs { case (dir, tmp) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tmp is not used. Why not just name them as (outputDir, checkpointDir)? Same for other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| val q1_source = MemoryStream[String] | ||
| val q1_checkpointDir = new File(dir, "q1_checkpointDir").getCanonicalPath | ||
| val q1_outputDir = new File(dir, "q1_outputDir") | ||
| assert(q1_outputDir.mkdir()) // prepare the output dir for q2 to read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just put the command following the statement with 1 space. Using the current format is hard to maintain in future because it requires to align comments. Same for other comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
understood & fixed
| testStream(q2)( | ||
| AssertOnQuery { q2 => | ||
| val fileSource = getSourcesFromStreamingQuery(q2).head | ||
| fileSource.sourceHasMetadata === None // q1 has not started yet, verify that q2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: put the comment above this line. Same for other comments
// q1 has not started yet, verify that q2 doesn't know whether q1 has metadata
fileSource.sourceHasMetadata === None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| fileSource.sourceHasMetadata === Some(true) // q1 has started, verify that q2 knows q1 has | ||
| // metadata by now | ||
| }, | ||
| CheckAnswer("keep2"), // answer should be correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: // answer should be correct is obvious. Don't add such comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| // doesn't know whether q1 has metadata | ||
| }, | ||
| Execute { _ => | ||
| q1 = q1_write.start(q1_outputDir.getCanonicalPath) // start q1 !!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: // start q1 !!! is obvious. Don't add such comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| q2ProcessAllAvailable(), | ||
| CheckAnswer("keep2", "keep3", "keep4"), | ||
|
|
||
| // stop q1 manually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: // stop q1 manually is obvious. Don't add such comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| } | ||
| } | ||
|
|
||
| test("read partitioned data from outputs of another streaming query") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test seems not necessary. It will pass even if the source doesn't use the partition information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the long term, we should write the partition information to the file sink log, then we can read it in the file source. However, it's out of scope. If you have time, you can think about it and submit a new PR after this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test removed -- Let me think about this write partition infommation thing :)
thanks!
| allFiles = allFilesUsingInMemoryFileIndex() | ||
| if (allFiles.isEmpty) { | ||
| // we still cannot decide | ||
| sourceHasMetadata match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simply switched to sourceHasMetadata match { case... case ... case ...}
actual diff is quite small
|
Test build #73653 has finished for PR 16987 at commit
|
|
LGTM. Thanks! Merging to master. |
What changes were proposed in this pull request?
Right now file source always uses
InMemoryFileIndexto scan files from a given path.But when reading the outputs from another streaming query, the file source should use
MetadataFileIndexto list files from the sink log. This patch adds this support.MetadataFileIndexorInMemoryFileIndexHow was this patch tested?
two newly added tests