-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46894][PYTHON] Move PySpark error conditions into standalone JSON file #44920
Conversation
@@ -0,0 +1,1096 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that it has to be packaged together, and able to be uploaded into PyPI. Would be great if we can make sure that still works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good call out. I was looking at MANIFEST.in
and I believe this file should be included, but I will confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed. I've updated the PR description accordingly.
# Note that though we call them "error classes" here, the proper name is "error conditions", | ||
# hence why the name of the JSON file different. | ||
# For more information, please see: https://issues.apache.org/jira/browse/SPARK-46810 | ||
ERROR_CONDITIONS_PATH = THIS_DIR / "error-conditions.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is related to the work being done in #44902, by the way.
@itholic can you take a look please? I remember you took a look and failed to find a good way to upload JSON. |
IIRC there was no major issue with managing the So I agree to change to a |
This is for a separate potential PR, but if it were possible to use the "main" error JSON files from Scala in PySpark automatically, would we want to do that? I see that PySpark's errors don't define a SQLSTATE, so I assumed they were a separate thing and we didn't want to reuse the main error definitions.
Is this a reference to this command? spark/python/docs/source/conf.py Line 42 in 8060e7e
Anyway, just FYI I am in the middle of revamping the main error documentation to make it easier to maintain. First I am cleaning up the terminology in #44902 but then I will be consolidating the various |
The build is still running, but the pyspark-core tests are passing. I believe |
I don't think so. As I recall, the main reason for not doing it was because, as you said, the error structure on the PySpark side is different from the error structure on the JVM side.
+1 |
Yes, so you might need to fix the description from https://github.com/apache/spark/blob/master/python/pyspark/errors_doc_gen.py#L44. spark/python/pyspark/errors_doc_gen.py Line 44 in 90e6c0c
|
I think we should wait for the conversation in SPARK-46810 to resolve before merging this in. But apart from that, is there anything more you'd like me to check here? Do you approve of the use of |
ERROR_CLASSES_JSON = ''' | ||
%s | ||
''' | ||
ERRORS_DIR = Path(__file__).parents[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, so it works both when you pip install pyspark
and when download this from Apache Spark channel (https://spark.apache.org/downloads.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this particular line will work when PySpark is packaged for distribution, but that's OK because _write_self()
is meant for use by developers who are writing to the JSON file during development. Right?
I don't think we want to use importlib.resources
here because that's for loading resources from a potentially read-only volume, which may be the case when PySpark is installed from a ZIP file, for example. Since this is a development tool, we need a functioning filesystem with write access, so __file__
will work fine.
ERROR_CLASSES_JSON = ''' | ||
%s | ||
''' | ||
ERRORS_DIR = Path(__file__).parents[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this particular line will work when PySpark is packaged for distribution, but that's OK because _write_self()
is meant for use by developers who are writing to the JSON file during development. Right?
I don't think we want to use importlib.resources
here because that's for loading resources from a potentially read-only volume, which may be the case when PySpark is installed from a ZIP file, for example. Since this is a development tool, we need a functioning filesystem with write access, so __file__
will work fine.
# Note: When we drop support for Python 3.8, we should migrate from importlib.resources.read_text() | ||
# to importlib.resources.files().joinpath().read_text(). | ||
# See: https://docs.python.org/3/library/importlib.resources.html#importlib.resources.open_text | ||
ERROR_CLASSES_JSON = importlib.resources.read_text("pyspark.errors", "error-conditions.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon - I think you are actually concerned about this line here, right?
I updated the PR description with the tests I ran to confirm that the JSON file gets packaged correctly and loaded at runtime.
You asked about installing PySpark via pip vs. downloading it from the site. I only tested pip, but both should work because importlib.resources
uses the same access pattern that import
does.
Is there some additional test I can run to cover the scenarios you are concerned about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manual tests should be fine. I just want to be cautious here .. because it's the very entry point of pyspark .... if we go wrong here, we would need another release :-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it's sensitive. We definitely don't want to precipitate an emergency release.
I am happy to volunteer for pre-release testing whenever the next release candidate for Spark 4.0 comes out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay here
Converting to draft until SPARK-46810 is resolved. |
We have agreed in SPARK-46810 to rename "error class" to "error condition", so this PR is unblocked since we know we won't need to rename the new The work to rename all instances of "error class" to "error condition" across the board will happen in SPARK-46810 and SPARK-47429. I would like to keep this PR focused on simply moving the Python error conditions into a JSON file. @HyukjinKwon - I believe this PR is ready to go. Do you have any oustanding concerns? |
Just to make sure, does it work when you install PySpark as a ZIP file? e.g., downloading it from https://spark.apache.org/downloads.html would install PySpark as a ZIP file. |
I've updated the PR description with this additional ZIP test (test 4). Just to confirm, the ZIP that gets uploaded to the site is the one under |
It's the one |
Hmm, so when people install this ZIP how exactly do they do it? Because it does not install cleanly like the ZIP under
|
People can actually directly use PySpark via importing |
@HyukjinKwon - Anything else you'd like to see done here? |
Friendly ping @HyukjinKwon. |
ERROR_CLASSES_MAP = json.loads(ERROR_CLASSES_JSON) | ||
""" % json.dumps( | ||
error_classes.ERROR_CLASSES_MAP, sort_keys=True, indent=2 | ||
with open(ERRORS_DIR / "error-conditions.json", "w") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nchammas I believe it worked because you built Spark at the project root directory, and ERRORS_DIR
directory exists ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should read error-conditions.json
from pyspark.zip
.. and that's the real problem ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't understand the concern. This method here is _write_self()
. It's for development only. No user will run this when they install Spark, regardless of the installation method. That's what I was saying in my earlier comment on this method.
The real code path we care about is in error_classes.py
, not __init__.py
. And this is the code path that I tested in various ways and documented in the PR description.
I tested the zip installation method you were particularly concerned about in point 5:
Is there something about that test you think is inadequate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh oops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright LGTM one comment #44920 (comment). Sorry it was my bad. Let's see how it goes!
Merged to master. |
### What changes were proposed in this pull request? Minor fixes to the English of some comments I added in #44920. ### Why are the changes needed? Proper English -- OK, not _proper_, but more correct at least -- makes things easier to read. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Not tested beyond CI. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46321 from nchammas/error-cond-typo. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
global-exclude *.py[cod] __pycache__ .DS_Store | ||
# Reference: https://setuptools.pypa.io/en/latest/userguide/miscellaneous.html | ||
|
||
graft pyspark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nchammas Seems like this ending up with adding all tests as well. Could we just include that json file alone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, graft
pulls everything.
We can try to just include what we think we need, but it's probably safer (and easier) in the long run to instead exclude what we don't want to package, like tests. Would that work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's safer so we don't miss something out ... but let's just add json
file alone .. I think it's more import to get rid of unrelated files ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Are you planning to address this in #46331 (or some other PR), or would you like me to take care of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate if you make another PR :-)
…o PyPI package ### What changes were proposed in this pull request? This PR is a followup of #44920 that includes `error-conditions.json` into PyPI package. ### Why are the changes needed? So PyPI packages work with pip install. https://github.com/apache/spark/actions/runs/8913716987/job/24479781481 is broken for now. ### Does this PR introduce _any_ user-facing change? No, the main change has not been released out yet. ### How was this patch tested? Manually tested the packaging, and check the file exists in the package. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46328 from HyukjinKwon/SPARK-46894-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Change the Python manifest so that tests are excluded from the packages that are built for distribution. ### Why are the changes needed? Tests were unintentionally included in the distributions as part of #44920. See [this comment](https://github.com/apache/spark/pull/44920/files#r1586979834). ### Does this PR introduce _any_ user-facing change? No, since #44920 hasn't been released to any users yet. ### How was this patch tested? I built Python packages and inspected `SOURCES.txt` to confirm that tests were excluded: ```sh cd python rm -rf pyspark.egg-info || echo "No existing egg info file, skipping deletion" python3 packaging/classic/setup.py sdist python3 packaging/connect/setup.py sdist find dist -name '*.tar.gz' | xargs -I _ tar xf _ --directory=dist cd .. open python/dist find python/dist -name SOURCES.txt | xargs code ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46354 from nchammas/SPARK-48107-package-json. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…SON file ### What changes were proposed in this pull request? Move PySpark error conditions into a standalone JSON file. Adjust the instructions in `MANIFEST.in` so they package the new JSON file correctly. ### Why are the changes needed? Having the JSON in its own file enables better IDE support for editing and managing the JSON. For example, VS Code will detect duplicate keys automatically and underline them. This change also simplifies the logic to regenerate the JSON. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? 1. Existing unit tests. 2. I confirmed that rewriting the JSON file works as intended: ```python from pyspark.errors.exceptions import _write_self _write_self() ``` 3. I built a PySpark distribution, installed it, and confirmed the error class test still works. ```sh trash dist/ ./dev/make-distribution.sh --pip -DskipTests -Phive # The JSON file is put in the dist/ directory. find dist -name 'error-conditions.json' # dist/python/pyspark/errors/error-conditions.json # It's also listed in SOURCES.txt. grep error-conditions.json dist/python/pyspark.egg-info/SOURCES.txt # pyspark/errors/error-conditions.json cd dist/python python -m venv venv source venv/bin/activate pip install . # It also gets installed into the newly created venv. find venv -name error-conditions.json # venv/lib/python3.11/site-packages/pyspark/errors/error-conditions.json # Running the test from inside the venv also succeeds. python pyspark/errors/tests/test_errors.py # Ran 3 tests in 0.000s # # OK ``` 4. I repeated test 3, but this time used the generated ZIP to install PySpark in a fresh virtual environment: ```sh $ cd .../spark/ $ trash dist $ ./dev/make-distribution.sh --pip -DskipTests -Phive $ cd .../Desktop/ $ python -m venv venv $ source venv/bin/activate $ pip install .../spark/python/dist/pyspark-4.0.0.dev0.tar.gz $ find venv -name error-conditions.json venv/lib/python3.11/site-packages/pyspark/errors/error-conditions.json $ python Python 3.11.7 (main, Dec 11 2023, 02:35:11) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from pyspark.errors.error_classes import ERROR_CLASSES_MAP >>> len(ERROR_CLASSES_MAP) 222 ``` You can see that the JSON file was packaged correctly and installed into the virtual environment, and `ERROR_CLASSES_MAP` loaded correctly as well. 5. I installed PySpark by including a different built ZIP in the `PYTHONPATH` as [described here][pyzip]: ```sh $ cd .../spark/ $ trash dist $ ./dev/make-distribution.sh --pip -DskipTests -Phive $ cd .../Desktop/ $ export SPARK_HOME=".../spark" $ export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH $ python Python 3.11.7 (main, Dec 11 2023, 02:35:11) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from pyspark.errors.error_classes import ERROR_CLASSES_MAP >>> len(ERROR_CLASSES_MAP) 222 ``` [pyzip]: https://spark.apache.org/docs/3.5.1/api/python/getting_started/install.html#manually-downloading ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44920 from nchammas/pyspark-error-json. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Minor fixes to the English of some comments I added in apache#44920. ### Why are the changes needed? Proper English -- OK, not _proper_, but more correct at least -- makes things easier to read. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Not tested beyond CI. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46321 from nchammas/error-cond-typo. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…o PyPI package ### What changes were proposed in this pull request? This PR is a followup of apache#44920 that includes `error-conditions.json` into PyPI package. ### Why are the changes needed? So PyPI packages work with pip install. https://github.com/apache/spark/actions/runs/8913716987/job/24479781481 is broken for now. ### Does this PR introduce _any_ user-facing change? No, the main change has not been released out yet. ### How was this patch tested? Manually tested the packaging, and check the file exists in the package. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46328 from HyukjinKwon/SPARK-46894-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Change the Python manifest so that tests are excluded from the packages that are built for distribution. ### Why are the changes needed? Tests were unintentionally included in the distributions as part of apache#44920. See [this comment](https://github.com/apache/spark/pull/44920/files#r1586979834). ### Does this PR introduce _any_ user-facing change? No, since apache#44920 hasn't been released to any users yet. ### How was this patch tested? I built Python packages and inspected `SOURCES.txt` to confirm that tests were excluded: ```sh cd python rm -rf pyspark.egg-info || echo "No existing egg info file, skipping deletion" python3 packaging/classic/setup.py sdist python3 packaging/connect/setup.py sdist find dist -name '*.tar.gz' | xargs -I _ tar xf _ --directory=dist cd .. open python/dist find python/dist -name SOURCES.txt | xargs code ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46354 from nchammas/SPARK-48107-package-json. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Move PySpark error conditions into a standalone JSON file.
Adjust the instructions in
MANIFEST.in
so they package the new JSON file correctly.Why are the changes needed?
Having the JSON in its own file enables better IDE support for editing and managing the JSON. For example, VS Code will detect duplicate keys automatically and underline them.
This change also simplifies the logic to regenerate the JSON.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing unit tests.
I confirmed that rewriting the JSON file works as intended:
I built a PySpark distribution, installed it, and confirmed the error class test still works.
I repeated test 3, but this time used the generated ZIP to install PySpark in a fresh virtual environment:
You can see that the JSON file was packaged correctly and installed into the virtual environment, and
ERROR_CLASSES_MAP
loaded correctly as well.I installed PySpark by including a different built ZIP in the
PYTHONPATH
as described here:Was this patch authored or co-authored using generative AI tooling?
No.