Skip to content

Conversation

@LiangchangZ
Copy link

@LiangchangZ LiangchangZ commented Apr 25, 2019

What changes were proposed in this pull request?

window($"fooTime", "2 seconds").alias("fooWindow") can generate an expression tree Alias(fooWindow) <- TimeWindow. The tree will become Alias(fooWindow) <- Alias(window) <- Window(start, end) after analyzed by TimeWindowing rule. The Alias(window) got metadata of watermark when created:

val windowStruct = Alias(getWindow(0, 1), WINDOW_COL_NAME)(
exprId = windowAttr.exprId, explicitMetadata = Some(metadata))

but the Alias(fooWindow) is created before TimeWindowing rule effected. Its code path is:

...
case ne: NamedExpression => Alias(expr, alias)(explicitMetadata = Some(ne.metadata))
...

before TimeWindowing rule effected, the ne.metadata is None and cause the watermark metadata lost

We make the def name(alias: String) return a Alias which get metadata from its child automatically, when not specifying metadata explicitly.

Thank @LinhongLiu for helping analyzing this problem!

How was this patch tested?

Add a UT and do the integration tests by run the example in jira successfully and do not throw org.apache.spark.sql.AnalysisException anymore

…ne explicitMetadata, so the metadata of Alias object will directly got from its child

Change-Id: Ia2246b05688461ad907f1e16c96e7282f655d5a6
Change-Id: Ie04be37fd6a4859b3b0da0c7c07d54558cb23758
@LiangchangZ LiangchangZ changed the title [SPARK-27340] Alias on TimeWindow expression may cause watermark metadata lost [SPARK-27340][Streaming] Alias on TimeWindow expression may cause watermark metadata lost Apr 25, 2019
Change-Id: If8f4ba6a93f4a37ffb096b95c62dce8f87db3233
@LiangchangZ
Copy link
Author

cc @xuanyuanking

case ne: NamedExpression => Alias(expr, alias)(explicitMetadata = Some(ne.metadata))
case other => Alias(other, alias)()
}
Alias(expr, alias)()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function introduced in #11908, the change here will also influence def alias(alias: String): Column and def as(alias: String): Column, do you check all the test cases related?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have successfully run all relevant UT test.
(I‘m sorry that we have a long holiday recently, so the reply is late.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related tests maybe not enough, as your change, as function has a different behavior, metadata no longer pass to explicitMetadata. My suggestion:

  1. For safety, run all test, not only the related.
  2. Do the fix just for SS scenario.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27340][Streaming] Alias on TimeWindow expression may cause watermark metadata lost [SPARK-27340][SS] Alias on TimeWindow expression may cause watermark metadata lost Jul 28, 2019
@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jul 29, 2019

Test build #108285 has finished for PR 24457 at commit a0508c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Dec 30, 2019
@github-actions github-actions bot closed this Dec 31, 2019
@dongjoon-hyun dongjoon-hyun reopened this Jan 6, 2020
.select(window($"eventTime", "5 seconds") as 'aliasWindow)

assert(aliasWindow.logicalPlan.output.exists(
_.metadata.contains(EventTimeWatermark.delayKey)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this test case seems to fail on the master branch (as of today), the issue seems to exist still.

@dongjoon-hyun
Copy link
Member

While preparing at 2.4.5 release, I just noticed that this was closed recently and we might need to fix the underlying issue. The test case failed in both master and branch-2.4.

If watermarks are ignored, the internal state grows indefinitely. How do you think about the reported issue, @tdas , @zsxwing , @cloud-fan , @HeartSaVioR , @gatorsmile?

@github-actions github-actions bot closed this Jan 7, 2020
@HeartSaVioR
Copy link
Contributor

Looking at the test code, the issue seems to be valid and PR fixes the issue correctly. But I'm not sure about the side effect, as @xuanyuanking commented.

Btw I think this has been known issue and underlying issue may not just be missing copying metadata. I'm not sure Spark can ensure metadata is propagated correctly during any multiple transformations, including typed -> untyped, and vice versa. It doesn't seem to be a thing we can rely on.

I think the root issue is that the event time column and value are open to modify. Other streaming frameworks provide the way to specify the event time per row, and the value is treated as special column which cannot be modified (both column and value) during transformation.

I've had a long discussion with @echauchot (working with Spark runner in Beam) regarding this. Please follow the link : #23576 (comment)

@HeartSaVioR
Copy link
Contributor

Oh bot closed the PR again... Looks like we should also remove Stale tag as well when reopening.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 7, 2020

Thank you for feedback, @HeartSaVioR . I also agree with you and was surprised with this. With watermark bugs, Apache Spark structured streaming is not usable at all in case of the state operations.

For auto-close, @nchammas , what is the correct reopening process?
(also cc @srowen )

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@nchammas
Copy link
Contributor

nchammas commented Jan 7, 2020

I think removing the Stale tag should do the trick. If not, I can investigate. We should perhaps update the close message to direct people accordingly.

Can PR authors remove the tag themselves, by the way, or does that require a committer?

@LinhongLiu
Copy link
Contributor

@dongjoon-hyun @HeartSaVioR @xuanyuanking
Hi, the author of this patch @LiangchangZ currently has no time to continue it. If it needs to be fixed, I can carry on the following works.

@dongjoon-hyun
Copy link
Member

Thank you, @nchammas . Tagging requires committership.

Thank you, @LinhongLiu . Yes. We can do that while keeping @LiangchangZ 's authorship. The current status of this PR is a stage where we are discussing the validity and impact of this bug. I believe this should be considered as Critical because this is disabling the core feature's availability. Let's wait and see the other people's comments.

@xuanyuanking
Copy link
Member

@LinhongLiu Thanks Linhong, agree to continue this bugfix, please go ahead.

@echauchot
Copy link

echauchot commented Jan 7, 2020

With watermark bugs, Apache Spark structured streaming is not usable at all in case of the state operations.

Indeed, we (Beam) are stuck in the streaming mode implementation of the translation layer (Beam runner) using StructuredStreaming framework. What about re-opening #23576 as well ? CC @arunmahadevan

@srowen
Copy link
Member

srowen commented Jan 7, 2020

This was already open. I think the stale tag just has to be removed, done yesterday.

srowen pushed a commit that referenced this pull request Jan 7, 2020
Follow-on to #26877.

### What changes were proposed in this pull request?

This PR tweaks the stale PR message to [clarify](#24457 (comment)) the procedure for reopening a PR after it has been marked as stale.

### Why are the changes needed?

This change should clarify the reopening process for contributors.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A

Closes #27114 from nchammas/SPARK-30173-stale-tweaks.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Apr 17, 2020
@xuanyuanking
Copy link
Member

xuanyuanking commented Apr 17, 2020

@LinhongLiu Are you still working on this? If not, I will take over and ping you for review.

@github-actions github-actions bot closed this Apr 18, 2020
@dongjoon-hyun
Copy link
Member

Please take over this, @xuanyuanking . Thanks~

@xuanyuanking
Copy link
Member

Thanks, I will submit a new PR today.

dongjoon-hyun pushed a commit that referenced this pull request Apr 27, 2020
…data lost

Credit to LiangchangZ, this PR reuses the UT as well as integrate test in #24457. Thanks Liangchang for your solid work.

### What changes were proposed in this pull request?
Make metadata propagatable between Aliases.

### Why are the changes needed?
In Structured Streaming, we added an Alias for TimeWindow by default.
https://github.com/apache/spark/blob/590b9a0132b68d9523e663997def957b2e46dfb1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3272-L3273
For some cases like stream join with watermark and window, users need to add an alias for convenience(we also added one in StreamingJoinSuite). The current metadata handling logic for `as` will lose the watermark metadata
https://github.com/apache/spark/blob/590b9a0132b68d9523e663997def957b2e46dfb1/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1049-L1054
 and finally cause the AnalysisException:
```
Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition
```

### Does this PR introduce any user-facing change?
Bugfix for an alias on time window with watermark.

### How was this patch tested?
New UTs added. One for the functionality and one for explaining the common scenario.

Closes #28326 from xuanyuanking/SPARK-27340.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Apr 27, 2020
…data lost

Credit to LiangchangZ, this PR reuses the UT as well as integrate test in #24457. Thanks Liangchang for your solid work.

### What changes were proposed in this pull request?
Make metadata propagatable between Aliases.

### Why are the changes needed?
In Structured Streaming, we added an Alias for TimeWindow by default.
https://github.com/apache/spark/blob/590b9a0132b68d9523e663997def957b2e46dfb1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3272-L3273
For some cases like stream join with watermark and window, users need to add an alias for convenience(we also added one in StreamingJoinSuite). The current metadata handling logic for `as` will lose the watermark metadata
https://github.com/apache/spark/blob/590b9a0132b68d9523e663997def957b2e46dfb1/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1049-L1054
 and finally cause the AnalysisException:
```
Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition
```

### Does this PR introduce any user-facing change?
Bugfix for an alias on time window with watermark.

### How was this patch tested?
New UTs added. One for the functionality and one for explaining the common scenario.

Closes #28326 from xuanyuanking/SPARK-27340.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit ba7adc4)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request Jan 28, 2021
…data lost

Credit to LiangchangZ, this PR reuses the UT as well as integrate test in apache#24457. Thanks Liangchang for your solid work.

Make metadata propagatable between Aliases.

In Structured Streaming, we added an Alias for TimeWindow by default.
https://github.com/apache/spark/blob/590b9a0132b68d9523e663997def957b2e46dfb1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3272-L3273
For some cases like stream join with watermark and window, users need to add an alias for convenience(we also added one in StreamingJoinSuite). The current metadata handling logic for `as` will lose the watermark metadata
https://github.com/apache/spark/blob/590b9a0132b68d9523e663997def957b2e46dfb1/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1049-L1054
 and finally cause the AnalysisException:
```
Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition
```

Bugfix for an alias on time window with watermark.

New UTs added. One for the functionality and one for explaining the common scenario.

Closes apache#28326 from xuanyuanking/SPARK-27340.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants