Skip to content

Conversation

@Yaohua628
Copy link
Contributor

What changes were proposed in this pull request?

We added the support to query the _metadata column with a file-based streaming source: #35676.

We propose to use transformUp instead of match when pattern matching the dataPlan in MicroBatchExecution runBatch method in this PR. It is fine for FileStreamSource because FileStreamSource always returns one LogicalRelation node (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L247).

But the proposed change will make the logic robust and we really should not rely on the upstream source to return a desired plan. In addition, the proposed change could also make _metadata work if someone wants to customize FileStreamSource getBatch.

Why are the changes needed?

Robust

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Thanks for fixing missed spot! The change is obvious and we can go with existing tests.

@HeartSaVioR HeartSaVioR changed the title [SPARK-39404][SQL][Streaming] Minor fix for querying _metadata in streaming [SPARK-39404][SQL][SS] Minor fix for querying _metadata in streaming Jun 8, 2022
@HeartSaVioR HeartSaVioR changed the title [SPARK-39404][SQL][SS] Minor fix for querying _metadata in streaming [SPARK-39404][SS] Minor fix for querying _metadata in streaming Jun 8, 2022
@HeartSaVioR
Copy link
Contributor

Thanks! Merging to master.

@felipepessoto
Copy link

@HeartSaVioR @Yaohua628, I don't see this commit in 3.3.1-rc4 branch, while we have more recent commits, e.g.: 946a960 in RC4

I'm wondering if any particular reason to don't include this fix.

Thanks

@Yaohua628
Copy link
Contributor Author

Yaohua628 commented Oct 21, 2022

@HeartSaVioR @Yaohua628, I don't see this commit in 3.3.1-rc4 branch, while we have more recent commits, e.g.: 946a960 in RC4

I'm wondering if any particular reason to don't include this fix.

Thanks

Ah, good catch! I guess we missed merging it to 3.3. I will have a backport PR shortly cc @HeartSaVioR

Thanks!

Yaohua628 added a commit to Yaohua628/spark that referenced this pull request Oct 22, 2022
### What changes were proposed in this pull request?
We added the support to query the `_metadata` column with a file-based streaming source: apache#35676.

We propose to use `transformUp` instead of `match` when pattern matching the `dataPlan` in `MicroBatchExecution` `runBatch` method in this PR. It is fine for `FileStreamSource` because `FileStreamSource` always returns one `LogicalRelation` node (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L247).

But the proposed change will make the logic robust and we really should not rely on the upstream source to return a desired plan. In addition, the proposed change could also make `_metadata` work if someone wants to customize `FileStreamSource` `getBatch`.

### Why are the changes needed?
Robust

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes apache#36801 from Yaohua628/spark-39404.

Authored-by: yaohua <yaohua.zhao@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
HeartSaVioR pushed a commit that referenced this pull request Oct 22, 2022
### What changes were proposed in this pull request?

(This cherry-picks #36801)

We added the support to query the `_metadata` column with a file-based streaming source: #35676.

We propose to use `transformUp` instead of `match` when pattern matching the `dataPlan` in `MicroBatchExecution` `runBatch` method in this PR. It is fine for `FileStreamSource` because `FileStreamSource` always returns one `LogicalRelation` node (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L247).

But the proposed change will make the logic robust and we really should not rely on the upstream source to return a desired plan. In addition, the proposed change could also make `_metadata` work if someone wants to customize `FileStreamSource` `getBatch`.

### Why are the changes needed?
Robust

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #38337 from Yaohua628/spark-39404-3-3.

Authored-by: yaohua <yaohua.zhao@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants