Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Fix unexpected AnalysisException from Spark Connect client

Why are the changes needed?

to make PS work on Connect

Does this PR introduce any user-facing change?

yes

How was this patch tested?

manually check

@zhengruifeng
Copy link
Contributor Author

In #39925, we introduced a new mechanism to resolve expression with specified plan.

However, sometimes the plan ID might be eliminated by the analyzer, and then some expressions can not be correctly resolved, this issue is the No.1 blocker of PS on Connect.

Currently, I investigate the two examples in the ticket and check each rule applied to them.

example 1:

>>> import pyspark.pandas as ps
>>> psdf1 = ps.DataFrame({"A": [1, 2, 3]})
>>> psdf2 = ps.DataFrame({"B": [1, 2, 3]})
>>> psdf1.append(psdf2)

example 2:

import pyspark.pandas as ps
import pandas as pd

pdf = pd.DataFrame({"A": [None, 3, None, None], "B": [2, 4, None, 3], "C": [None, None, None, 1], "D": [0, 1, 5, 4],}, columns=["A", "B", "C", "D"],)
psdf = ps.from_pandas(pdf)
psdf.backfill()

In the draft, I modify two rules to retain the plan id. (actually, I modified ResolveNaturalAndUsingJoin in 167bbca)

I am wondering whether is there some graceful approach to fix this issue? Otherwise, I'm afraid I will touch more rules.

cc @cloud-fan @HyukjinKwon @itholic

val newProject = Project(finalProjectList, withWindow)

// retain the plan id used in Spark Connect
planId.foreach(newProject.setTagValue(LogicalPlan.PLAN_ID_TAG, _))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to fix example 2.

}

// retain the plan id used in Spark Connect
planId.foreach(newPlan.setTagValue(LogicalPlan.PLAN_ID_TAG, _))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to fix example 1

@zhengruifeng
Copy link
Contributor Author

not sure whether this works:

  • for each analysis rule, if the plan is changed, check whether the output plan (as well as its descendents) contains all the planIDs from the input plan (as well as its descendents), if some planID are missing, attach the missing ones to the output plan;
  • we can still modify some rule if need;

@itholic
Copy link
Contributor

itholic commented Jul 17, 2023

Thanks for your time on investigating this issue! 👍

Since this issue is also blocking numerous test cases other than the two examples provided in the ticket, I have created a separate temporary test PR here: #42041 that cherry-picks this fix and enables all tests related to this issue to check if the fix is can be applied to all other test cases as well.

P.S. If the cause of failure for every test is different and difficult to fix in one PR, we may need to break this issue into multiple sub-tasks.

@zhengruifeng
Copy link
Contributor Author

close this one in favor of #42086

@zhengruifeng zhengruifeng deleted the ps_analysis branch July 27, 2023 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants