-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17154][SQL] Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations #14719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b3e887c to
05872b7
Compare
|
Test build #64081 has finished for PR 14719 at commit
|
|
Test build #64082 has finished for PR 14719 at commit
|
|
See a related PR by @cloud-fan : #11632 |
|
Test build #64116 has finished for PR 14719 at commit
|
|
Test build #64126 has finished for PR 14719 at commit
|
|
@gatorsmile Thanks for the information. I'll check it. |
|
retest this please. |
|
Test build #64137 has finished for PR 14719 at commit
|
|
Test build #64145 has finished for PR 14719 at commit
|
|
Test build #64144 has finished for PR 14719 at commit
|
|
Test build #64146 has finished for PR 14719 at commit
|
|
Test build #64148 has finished for PR 14719 at commit
|
|
Test build #64155 has finished for PR 14719 at commit
|
|
It's really a hard problem and we have discussed it many times but can't reach a consensus. Do you mind sending a design doc first so that it's easy for other people to review and discuss? thanks! |
|
@cloud-fan Of course. I'll write a design doc soon. |
|
Test build #64538 has finished for PR 14719 at commit
|
|
Test build #64537 has finished for PR 14719 at commit
|
|
In the current commit(b778b5d) I tried changing to prohibit direct self-join. |
|
Test build #66034 has finished for PR 14719 at commit
|
b778b5d to
437ac99
Compare
|
I noticed So I don't change to prohibit self-join on second thought. |
|
@sarutak, on the surface, the problem looks like in the Optimization code but in fact, the root cause is the column/ExprId C2#77 from T2 are indistinguishable between the two streams referencing the relation T2, one in the right table of the LEFT JOIN and the other in the IN subquery. This further makes the Optimization rule My comments in SPARK-17337 on 31/Aug/16 14:42 and 14:43 explain in more details. |
|
The test case @sarutak raised here is what I consider the problem of the current code.
How do we make a conclusion that the left operand df("key") references the first relation df and the right operans the second relation df? We can't. Arguably, we can treat this predicate as a local predicate in which both operands reference one of the two relations, that is, it could mean any of the three cases below:
Hence this type of statement should yield an ambiguous reference error. |
|
Test build #66040 has finished for PR 14719 at commit
|
|
Test build #66747 has finished for PR 14719 at commit
|
|
Test build #67692 has finished for PR 14719 at commit
|
|
Test build #67693 has finished for PR 14719 at commit
|
|
@cloud-fan, I was studying the ResolveSubquery code for my work on SPARK-17348. I was first puzzle about the code in Until I debugged a SQL that referenced the same table in both the outer table and the table in the subquery that I realized I ran into a similar issue like this one we are trying to fix. I think my proposal of generating a new ExprId for each column will make this piece of code unnecessary. |
|
Test build #68750 has finished for PR 14719 at commit
|
|
Test build #68753 has started for PR 14719 at commit |
|
retest this please. |
|
Test build #68760 has finished for PR 14719 at commit
|
|
Test build #70911 has finished for PR 14719 at commit
|
|
@sarutak would your code be able to solve this ambiguity of df("a") in the join condition? Here is my understanding. At the first function
Spark implicitly creates a new Dataset. So when it tries to resolve the column Taking the above example, we can draw a tree resembling the embedded structure of the Dataset where
When we try to resolve Your breath-first-search walk may hit the correct An interesting test scenario to verify this would be the one below:
|
|
Hi @sarutak, what do you think about ^? |
|
@HyukjinKwon Thanks for pinging me! I still think this issue should be fixed but I didn't notice @nsyca's last comment. I'll consider the problem which he mentioned soon. |
|
ping @sarutak |
|
I found this solution can't resolve this issue in some corner case. I'll close this PR for now and will revise later. |
What changes were proposed in this pull request?
When we join two DataFrames which are originated from a same DataFrame, operations to the joined DataFrame can fail.
One reproducible example is as follows.
In this case, AnalysisException is thrown.
Another example is as follows.
In this case, we will expect to get the answer like as follows.
But the actual result is as follows.
The cause of the problems in the examples is that the logical plan related to the right side DataFrame and the expressions of its output are re-created in the analyzer (at ResolveReference rule) when a DataFrame has expressions which have a same exprId each other.
Re-created expressions are equally to the original ones except exprId.
This will happen when we do self-join or similar pattern operations.
In the first example,
df("col3")returns a Column which includes an expression and the expression have an exprId (say id1 here).After join, the expresion which the right side DataFrame (
df) has is re-created and the old and new expressions are equally but exprId is renewed (say id2 for the new exprId here).Because of the mismatch of those exprIds, AnalysisException is thrown.
In the second example,
df("col1")returns a column and the expression contained in the column is assigned an exprId (say id3).On the other hand, a column returned by
filtered("col1")has an expression which has the same exprId (id3).After join, the expressions in the right side DataFrame are re-created and the expression assigned id3 is no longer present in the right side but present in the left side.
So, referring
df("col1")to the joined DataFrame, we get col1 of right side which includes null.To resolve this issue, I have introduced
LazilyDeterminedAttribute.It is returned when we refer a column like
df("expr")and determines which expressiondf("expr")should point to lazily.How was this patch tested?
I added some test cases.