Skip to content

Conversation

@Yaohua628
Copy link
Contributor

@Yaohua628 Yaohua628 commented Oct 22, 2022

What changes were proposed in this pull request?

(This cherry-picks #36801)

We added the support to query the _metadata column with a file-based streaming source: #35676.

We propose to use transformUp instead of match when pattern matching the dataPlan in MicroBatchExecution runBatch method in this PR. It is fine for FileStreamSource because FileStreamSource always returns one LogicalRelation node (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L247).

But the proposed change will make the logic robust and we really should not rely on the upstream source to return a desired plan. In addition, the proposed change could also make _metadata work if someone wants to customize FileStreamSource getBatch.

Why are the changes needed?

Robust

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

### What changes were proposed in this pull request?
We added the support to query the `_metadata` column with a file-based streaming source: apache#35676.

We propose to use `transformUp` instead of `match` when pattern matching the `dataPlan` in `MicroBatchExecution` `runBatch` method in this PR. It is fine for `FileStreamSource` because `FileStreamSource` always returns one `LogicalRelation` node (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L247).

But the proposed change will make the logic robust and we really should not rely on the upstream source to return a desired plan. In addition, the proposed change could also make `_metadata` work if someone wants to customize `FileStreamSource` `getBatch`.

### Why are the changes needed?
Robust

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes apache#36801 from Yaohua628/spark-39404.

Authored-by: yaohua <yaohua.zhao@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
@Yaohua628
Copy link
Contributor Author

cc: @HeartSaVioR @felipepessoto

@HeartSaVioR
Copy link
Contributor

https://github.com/Yaohua628/spark/runs/9040960126

Build passed - it looks to be not reflected.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@HeartSaVioR
Copy link
Contributor

Thanks! Merging to 3.3.

HeartSaVioR pushed a commit that referenced this pull request Oct 22, 2022
### What changes were proposed in this pull request?

(This cherry-picks #36801)

We added the support to query the `_metadata` column with a file-based streaming source: #35676.

We propose to use `transformUp` instead of `match` when pattern matching the `dataPlan` in `MicroBatchExecution` `runBatch` method in this PR. It is fine for `FileStreamSource` because `FileStreamSource` always returns one `LogicalRelation` node (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L247).

But the proposed change will make the logic robust and we really should not rely on the upstream source to return a desired plan. In addition, the proposed change could also make `_metadata` work if someone wants to customize `FileStreamSource` `getBatch`.

### Why are the changes needed?
Robust

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #38337 from Yaohua628/spark-39404-3-3.

Authored-by: yaohua <yaohua.zhao@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants