[SPARK-3634] [PySpark] User's module should take precedence over system modules #2492

davies · 2014-09-22T19:04:39Z

Python modules added through addPyFile should take precedence over system modules.

This patch put the path for user added module in the front of sys.path (just after '').

SparkQA · 2014-09-22T19:09:51Z

QA tests have started for PR 2492 at commit c16c392.

This patch merges cleanly.

SparkQA · 2014-09-22T19:24:22Z

QA tests have started for PR 2492 at commit 6b0002f.

This patch merges cleanly.

SparkQA · 2014-09-22T19:24:33Z

QA tests have finished for PR 2492 at commit 6b0002f.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-22T20:26:08Z

QA tests have started for PR 2492 at commit 6b0002f.

This patch merges cleanly.

SparkQA · 2014-09-22T20:29:49Z

QA tests have started for PR 2492 at commit f7ff4da.

This patch merges cleanly.

JoshRosen · 2014-09-22T20:45:48Z

BTW: it's a bit dangerous that user can upload new module to modify the default behavior of system. Currently, it's hard to find the the correct position to insert user's module.

Maybe my JIRA was misleadingly named; my motivation here is allowing users to specify versions of packages that take precedence over other versions of that same package that might be installed on the system, not in overriding modules included in Python's standard library (although the ability to do that is a side-effect of this change).

SparkQA · 2014-09-22T21:09:51Z

Tests timed out after a configured wait of 120m.

SparkQA · 2014-09-22T21:21:24Z

QA tests have finished for PR 2492 at commit 6b0002f.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-22T21:25:20Z

QA tests have finished for PR 2492 at commit f7ff4da.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-22T21:32:57Z

QA tests have started for PR 2492 at commit f7ff4da.

This patch merges cleanly.

SparkQA · 2014-09-22T21:34:21Z

QA tests have started for PR 2492 at commit 4a2af78.

This patch merges cleanly.

SparkQA · 2014-09-22T22:39:10Z

QA tests have finished for PR 2492 at commit f7ff4da.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging

SparkQA · 2014-09-22T22:41:15Z

QA tests have finished for PR 2492 at commit 4a2af78.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-22T22:41:18Z

Merged build finished. Test PASSed.

SparkQA · 2014-09-22T22:41:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20668/

davies · 2014-09-23T00:28:47Z

Maybe my JIRA was misleadingly named; my motivation here is allowing users to specify versions of packages that take precedence over other versions of that same package that might be installed on the system, not in overriding modules included in Python's standard library (although the ability to do that is a side-effect of this change).

Understood, this side-effect is bit dangerous. The third-package could appear in sys.path in any order, such as

>>> import sys
>>> sys.path
['', '//anaconda/lib/python2.7/site-packages/DPark-0.1-py2.7.egg', '//anaconda/lib/python2.7/site-packages/protobuf-2.5.0-py2.7.egg', '//anaconda/lib/python2.7/site-packages/msgpack_python-0.4.2-py2.7-macosx-10.5-x86_64.egg', '//anaconda/lib/python2.7/site-packages/setuptools-3.6-py2.7.egg', '/Users/daviesliu/work/spark/python/lib', '/Users/daviesliu/work/spark/python/lib/py4j-0.8.2.1-src.zip', '/Users/daviesliu/work/spark/python', '//anaconda/lib/python27.zip', '//anaconda/lib/python2.7', '//anaconda/lib/python2.7/plat-darwin', '//anaconda/lib/python2.7/plat-mac', '//anaconda/lib/python2.7/plat-mac/lib-scriptpackages', '//anaconda/lib/python2.7/lib-tk', '//anaconda/lib/python2.7/lib-old', '//anaconda/lib/python2.7/lib-dynload', '//anaconda/lib/python2.7/site-packages', '//anaconda/lib/python2.7/site-packages/PIL', '//anaconda/lib/python2.7/site-packages/runipy-0.1.0-py2.7.egg']

it's not easy to find a position which is before third-package but after standard module.

JoshRosen · 2014-09-23T05:25:53Z

Understood, this side-effect is bit dangerous. The third-package could appear in sys.path in any order

Are you worried about a user adding a Python module whose name conflicts with a built-in module, thereby shadowing it? I think this is a general Python problem that can occur even without sys.path manipulation, which is why it's bad to have top-level modules that have the same name as built-in ones (and also why relative imports can be bad): http://www.evanjones.ca/python-name-clashes.html

davies · 2014-09-23T22:14:30Z

I think it's fine to move on, and remove the comment about risk in PR's description.

mattf · 2014-09-24T12:00:15Z

this is a nice addition. re danger, i'll add that the user is only able to endanger herself.

+1 lgtm

JoshRosen · 2014-09-24T18:39:05Z

python/pyspark/context.py

-                sys.path.append(path)
-                if dirname not in sys.path:
-                    sys.path.append(dirname)
+                if filename.lower().endswith("zip") or filename.lower().endswith("egg"):


I think that spark.submit.pyFiles is allowed to contain .py files, too:

--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.

Will this new filtering by .zip and .egg prevent this from working?

The .py files will be put in root_dir, can be imported by name, so it should not put in sys.path. It will depend on that spark-submit copy the '.pyfile intoroot_dir` locally.

Put basedir of .py file into sys.path, will bring other issues if there are other files in the same directory, such copy.py

Do we explicitly add root_dir to sys.path? I don't think we can always assume that the Python driver / worker are executed from inside of root_dir.

Aha, I see that we do add root_dir to the path in worker.py.

root_dir is already added into sys.path, see LINE 174

Ah, great. In that case, this PR looks good to me, so I'm going to merge it. Thanks!

…42.7.4 and `mssql` to 12.8.1.jre11 ### What changes were proposed in this pull request? This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11. ### Why are the changes needed? 1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html): - [Issue #3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause - [Issue #4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230) - [Issue #3982](h2database/h2database#3982): Potential issue when using ROUND - [Issue #3894](h2database/h2database#3894): Race condition causing stale data in query last result cache - [Issue #4075](h2database/h2database#4075): infinite loop in compact - [Issue #4091](h2database/h2database#4091): Wrong case with linked table to postgresql - [Issue #4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs 2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/): - fix: PgInterval ignores case for represented interval string [PR #3344](pgjdbc/pgjdbc#3344) - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR #3295](pgjdbc/pgjdbc#3295) - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR #3304](pgjdbc/pgjdbc#3304) - fix: Ensure order of results for getDouble [PR #3301](pgjdbc/pgjdbc#3301) - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR #3248](pgjdbc/pgjdbc#3248) - fix: Fix SSL tests [PR #3260](pgjdbc/pgjdbc#3260) - fix: Support bytea in preferQueryMode=simple [PR #3243](pgjdbc/pgjdbc#3243) - fix: Fix [Issue #3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR #3235](pgjdbc/pgjdbc#3235) - fix: Fix [Issue #3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR #3225](pgjdbc/pgjdbc#3225) 3. For `mssql`, there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1): - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR #2492](microsoft/mssql-jdbc#2492) - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR #2493](microsoft/mssql-jdbc#2493) - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR #2494](microsoft/mssql-jdbc#2494) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47810 from wayneguow/ug_h2. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>

…42.7.4 and `mssql` to 12.8.1.jre11 ### What changes were proposed in this pull request? This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11. ### Why are the changes needed? 1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html): - [Issue apache#3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause - [Issue apache#4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230) - [Issue apache#3982](h2database/h2database#3982): Potential issue when using ROUND - [Issue apache#3894](h2database/h2database#3894): Race condition causing stale data in query last result cache - [Issue apache#4075](h2database/h2database#4075): infinite loop in compact - [Issue apache#4091](h2database/h2database#4091): Wrong case with linked table to postgresql - [Issue apache#4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs 2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/): - fix: PgInterval ignores case for represented interval string [PR apache#3344](pgjdbc/pgjdbc#3344) - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR apache#3295](pgjdbc/pgjdbc#3295) - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR apache#3304](pgjdbc/pgjdbc#3304) - fix: Ensure order of results for getDouble [PR apache#3301](pgjdbc/pgjdbc#3301) - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR apache#3248](pgjdbc/pgjdbc#3248) - fix: Fix SSL tests [PR apache#3260](pgjdbc/pgjdbc#3260) - fix: Support bytea in preferQueryMode=simple [PR apache#3243](pgjdbc/pgjdbc#3243) - fix: Fix [Issue apache#3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR apache#3235](pgjdbc/pgjdbc#3235) - fix: Fix [Issue apache#3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR apache#3225](pgjdbc/pgjdbc#3225) 3. For `mssql`, there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1): - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR apache#2492](microsoft/mssql-jdbc#2492) - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR apache#2493](microsoft/mssql-jdbc#2493) - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR apache#2494](microsoft/mssql-jdbc#2494) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47810 from wayneguow/ug_h2. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>

put addPyFile in front of sys.path

c16c392

add tests

6b0002f

ad license header

f7ff4da

fix tests

4a2af78

JoshRosen reviewed Sep 24, 2014
View reviewed changes

asfgit closed this in c854b9f Sep 24, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3634] [PySpark] User's module should take precedence over system modules #2492

[SPARK-3634] [PySpark] User's module should take precedence over system modules #2492

davies commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

JoshRosen commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

davies commented Sep 23, 2014

JoshRosen commented Sep 23, 2014

davies commented Sep 23, 2014

mattf commented Sep 24, 2014

JoshRosen Sep 24, 2014

davies Sep 24, 2014

JoshRosen Sep 24, 2014

JoshRosen Sep 24, 2014

davies Sep 24, 2014

JoshRosen Sep 24, 2014

[SPARK-3634] [PySpark] User's module should take precedence over system modules #2492

[SPARK-3634] [PySpark] User's module should take precedence over system modules #2492

Conversation

davies commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

JoshRosen commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

davies commented Sep 23, 2014

JoshRosen commented Sep 23, 2014

davies commented Sep 23, 2014

mattf commented Sep 24, 2014

JoshRosen Sep 24, 2014

Choose a reason for hiding this comment

davies Sep 24, 2014

Choose a reason for hiding this comment

JoshRosen Sep 24, 2014

Choose a reason for hiding this comment

JoshRosen Sep 24, 2014

Choose a reason for hiding this comment

davies Sep 24, 2014

Choose a reason for hiding this comment

JoshRosen Sep 24, 2014

Choose a reason for hiding this comment