Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46894][PYTHON] Move PySpark error conditions into standalone JSON file #44920

Closed
wants to merge 11 commits into from

Conversation

nchammas
Copy link
Contributor

@nchammas nchammas commented Jan 28, 2024

What changes were proposed in this pull request?

Move PySpark error conditions into a standalone JSON file.

Adjust the instructions in MANIFEST.in so they package the new JSON file correctly.

Why are the changes needed?

Having the JSON in its own file enables better IDE support for editing and managing the JSON. For example, VS Code will detect duplicate keys automatically and underline them.

This change also simplifies the logic to regenerate the JSON.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  1. Existing unit tests.

  2. I confirmed that rewriting the JSON file works as intended:

    from pyspark.errors.exceptions import _write_self
    _write_self()
  3. I built a PySpark distribution, installed it, and confirmed the error class test still works.

    trash dist/
    ./dev/make-distribution.sh --pip -DskipTests -Phive
    
    # The JSON file is put in the dist/ directory.
    find dist -name 'error-conditions.json'
    # dist/python/pyspark/errors/error-conditions.json
    
    # It's also listed in SOURCES.txt.
    grep error-conditions.json dist/python/pyspark.egg-info/SOURCES.txt 
    # pyspark/errors/error-conditions.json
    
    cd dist/python
    python -m venv venv
    source venv/bin/activate
    pip install .
    
    # It also gets installed into the newly created venv.
    find venv -name error-conditions.json
    # venv/lib/python3.11/site-packages/pyspark/errors/error-conditions.json
    
    # Running the test from inside the venv also succeeds.
    python pyspark/errors/tests/test_errors.py
    # Ran 3 tests in 0.000s
    # 
    # OK
  4. I repeated test 3, but this time used the generated ZIP to install PySpark in a fresh virtual environment:

    $ cd .../spark/
    $ trash dist
    $ ./dev/make-distribution.sh --pip -DskipTests -Phive
    $ cd .../Desktop/
    $ python -m venv venv
    $ source venv/bin/activate
    $ pip install .../spark/python/dist/pyspark-4.0.0.dev0.tar.gz
    $ find venv -name error-conditions.json
    venv/lib/python3.11/site-packages/pyspark/errors/error-conditions.json
    $ python
    Python 3.11.7 (main, Dec 11 2023, 02:35:11) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from pyspark.errors.error_classes import ERROR_CLASSES_MAP
    >>> len(ERROR_CLASSES_MAP)
    222

    You can see that the JSON file was packaged correctly and installed into the virtual environment, and ERROR_CLASSES_MAP loaded correctly as well.

  5. I installed PySpark by including a different built ZIP in the PYTHONPATH as described here:

    $ cd .../spark/
    $ trash dist
    $ ./dev/make-distribution.sh --pip -DskipTests -Phive
    $ cd .../Desktop/
    $ export SPARK_HOME=".../spark"
    $ export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
    $ python
    Python 3.11.7 (main, Dec 11 2023, 02:35:11) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from pyspark.errors.error_classes import ERROR_CLASSES_MAP
    >>> len(ERROR_CLASSES_MAP)
    222

Was this patch authored or co-authored using generative AI tooling?

No.

@@ -0,0 +1,1096 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that it has to be packaged together, and able to be uploaded into PyPI. Would be great if we can make sure that still works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good call out. I was looking at MANIFEST.in and I believe this file should be included, but I will confirm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed. I've updated the PR description accordingly.

Comment on lines 22 to 25
# Note that though we call them "error classes" here, the proper name is "error conditions",
# hence why the name of the JSON file different.
# For more information, please see: https://issues.apache.org/jira/browse/SPARK-46810
ERROR_CONDITIONS_PATH = THIS_DIR / "error-conditions.json"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is related to the work being done in #44902, by the way.

@HyukjinKwon
Copy link
Member

@itholic can you take a look please? I remember you took a look and failed to find a good way to upload JSON.

@itholic
Copy link
Contributor

itholic commented Jan 29, 2024

IIRC there was no major issue with managing the json itself. However, since we cannot integrate with the error-classes.json file on the JVM side - because we didn't want to have a JVM dependency -, we simply adopted a py file that is more convenient way to manage in Python (That said, we couldn't see any benefit to using json instead of py at that time).

So I agree to change to a json file if the advantage of using a json file over using a py file is clear, and if there are no issues with packaging. Also you might need to take a deeper look at the documentation. For example we're pointing the py file path from Error classes in PySpark.

@nchammas
Copy link
Contributor Author

nchammas commented Jan 29, 2024

since we cannot integrate with the error-classes.json file on the JVM side

This is for a separate potential PR, but if it were possible to use the "main" error JSON files from Scala in PySpark automatically, would we want to do that? I see that PySpark's errors don't define a SQLSTATE, so I assumed they were a separate thing and we didn't want to reuse the main error definitions.

So I agree to change to a json file if the advantage of using a json file over using a py file is clear, and if there are no issues with packaging. Also you might need to take a deeper look at the documentation. For example we're pointing the py file path from Error classes in PySpark.

Is this a reference to this command?

generate_errors_doc(output_rst_file_path)

Anyway, just FYI I am in the middle of revamping the main error documentation to make it easier to maintain. First I am cleaning up the terminology in #44902 but then I will be consolidating the various sql-error-* pages. I will tag you in those PRs when I open them.

@nchammas
Copy link
Contributor Author

The build is still running, but the pyspark-core tests are passing. I believe importlib.resources is what we need to load data files packaged in the distribution we upload to PyPI.

@itholic
Copy link
Contributor

itholic commented Jan 29, 2024

This is for a separate potential PR, but if it were possible to use the "main" error JSON files from Scala in PySpark automatically, would we want to do that?

I don't think so. As I recall, the main reason for not doing it was because, as you said, the error structure on the PySpark side is different from the error structure on the JVM side.

I will be consolidating the various sql-error-* pages. I will tag you in those PRs when I open them.

+1

@itholic
Copy link
Contributor

itholic commented Jan 29, 2024

Is this a reference to this command?

Yes, so you might need to fix the description from https://github.com/apache/spark/blob/master/python/pyspark/errors_doc_gen.py#L44.

This is a list of common, named error classes returned by PySpark which are defined at `error_classes.py <https://github.com/apache/spark/blob/master/python/pyspark/errors/error_classes.py>`_.

@github-actions github-actions bot added the CORE label Jan 29, 2024
@nchammas
Copy link
Contributor Author

I think we should wait for the conversation in SPARK-46810 to resolve before merging this in.

But apart from that, is there anything more you'd like me to check here? Do you approve of the use of importlib.resources (which I think is the "correct" solution in our case)?

ERROR_CLASSES_JSON = '''
%s
'''
ERRORS_DIR = Path(__file__).parents[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, so it works both when you pip install pyspark and when download this from Apache Spark channel (https://spark.apache.org/downloads.html)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this particular line will work when PySpark is packaged for distribution, but that's OK because _write_self() is meant for use by developers who are writing to the JSON file during development. Right?

I don't think we want to use importlib.resources here because that's for loading resources from a potentially read-only volume, which may be the case when PySpark is installed from a ZIP file, for example. Since this is a development tool, we need a functioning filesystem with write access, so __file__ will work fine.

ERROR_CLASSES_JSON = '''
%s
'''
ERRORS_DIR = Path(__file__).parents[1]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this particular line will work when PySpark is packaged for distribution, but that's OK because _write_self() is meant for use by developers who are writing to the JSON file during development. Right?

I don't think we want to use importlib.resources here because that's for loading resources from a potentially read-only volume, which may be the case when PySpark is installed from a ZIP file, for example. Since this is a development tool, we need a functioning filesystem with write access, so __file__ will work fine.

# Note: When we drop support for Python 3.8, we should migrate from importlib.resources.read_text()
# to importlib.resources.files().joinpath().read_text().
# See: https://docs.python.org/3/library/importlib.resources.html#importlib.resources.open_text
ERROR_CLASSES_JSON = importlib.resources.read_text("pyspark.errors", "error-conditions.json")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon - I think you are actually concerned about this line here, right?

I updated the PR description with the tests I ran to confirm that the JSON file gets packaged correctly and loaded at runtime.

You asked about installing PySpark via pip vs. downloading it from the site. I only tested pip, but both should work because importlib.resources uses the same access pattern that import does.

Is there some additional test I can run to cover the scenarios you are concerned about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual tests should be fine. I just want to be cautious here .. because it's the very entry point of pyspark .... if we go wrong here, we would need another release :-).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it's sensitive. We definitely don't want to precipitate an emergency release.

I am happy to volunteer for pre-release testing whenever the next release candidate for Spark 4.0 comes out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay here

@nchammas nchammas marked this pull request as draft January 30, 2024 05:21
@nchammas
Copy link
Contributor Author

Converting to draft until SPARK-46810 is resolved.

@nchammas nchammas marked this pull request as ready for review March 24, 2024 16:21
@nchammas
Copy link
Contributor Author

nchammas commented Mar 24, 2024

We have agreed in SPARK-46810 to rename "error class" to "error condition", so this PR is unblocked since we know we won't need to rename the new error-conditions.json file.

The work to rename all instances of "error class" to "error condition" across the board will happen in SPARK-46810 and SPARK-47429. I would like to keep this PR focused on simply moving the Python error conditions into a JSON file.

@HyukjinKwon - I believe this PR is ready to go. Do you have any oustanding concerns?

@HyukjinKwon
Copy link
Member

Just to make sure, does it work when you install PySpark as a ZIP file? e.g., downloading it from https://spark.apache.org/downloads.html would install PySpark as a ZIP file.

@nchammas
Copy link
Contributor Author

I've updated the PR description with this additional ZIP test (test 4).

Just to confirm, the ZIP that gets uploaded to the site is the one under python/dist/. Is that correct?

@HyukjinKwon
Copy link
Member

It's the one python/lib/pyspark.zip when you finish building.

@nchammas
Copy link
Contributor Author

Hmm, so when people install this ZIP how exactly do they do it? Because it does not install cleanly like the ZIP under python/dist/.

$ pip install .../spark/python/lib/pyspark.zip
Processing .../spark/python/lib/pyspark.zip
ERROR: file:///.../spark/python/lib/pyspark.zip does not appear to be a Python project:
  neither 'setup.py' nor 'pyproject.toml' found.

@HyukjinKwon
Copy link
Member

People can actually directly use PySpark via importing pyspark.zip, see https://spark.apache.org/docs/latest/api/python/getting_started/install.html?highlight=pythonpath#manually-downloading

@github-actions github-actions bot added the DOCS label Mar 29, 2024
@nchammas
Copy link
Contributor Author

nchammas commented Mar 29, 2024

OK, I tested that as well and updated the PR description accordingly.

I also tweaked the syntax highlighting for that bit of documentation you linked to, because it was off. This is how it currently looks:

Screenshot 2024-03-29 at 11 30 26 AM

Note the weird italicization and missing *.

@nchammas
Copy link
Contributor Author

nchammas commented Apr 8, 2024

@HyukjinKwon - Anything else you'd like to see done here?

@nchammas
Copy link
Contributor Author

Friendly ping @HyukjinKwon.

ERROR_CLASSES_MAP = json.loads(ERROR_CLASSES_JSON)
""" % json.dumps(
error_classes.ERROR_CLASSES_MAP, sort_keys=True, indent=2
with open(ERRORS_DIR / "error-conditions.json", "w") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nchammas I believe it worked because you built Spark at the project root directory, and ERRORS_DIR directory exists ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should read error-conditions.json from pyspark.zip .. and that's the real problem ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't understand the concern. This method here is _write_self(). It's for development only. No user will run this when they install Spark, regardless of the installation method. That's what I was saying in my earlier comment on this method.

The real code path we care about is in error_classes.py, not __init__.py. And this is the code path that I tested in various ways and documented in the PR description.

I tested the zip installation method you were particularly concerned about in point 5:

Screenshot 2024-04-30 at 12 12 25 AM

Is there something about that test you think is inadequate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh oops

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright LGTM one comment #44920 (comment). Sorry it was my bad. Let's see how it goes!

@HyukjinKwon
Copy link
Member

Merged to master.

@nchammas nchammas deleted the pyspark-error-json branch May 1, 2024 05:39
HyukjinKwon pushed a commit that referenced this pull request May 1, 2024
### What changes were proposed in this pull request?

Minor fixes to the English of some comments I added in #44920.

### Why are the changes needed?

Proper English -- OK, not _proper_, but more correct at least -- makes things easier to read.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Not tested beyond CI.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46321 from nchammas/error-cond-typo.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
global-exclude *.py[cod] __pycache__ .DS_Store
# Reference: https://setuptools.pypa.io/en/latest/userguide/miscellaneous.html

graft pyspark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nchammas Seems like this ending up with adding all tests as well. Could we just include that json file alone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, graft pulls everything.

We can try to just include what we think we need, but it's probably safer (and easier) in the long run to instead exclude what we don't want to package, like tests. Would that work for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's safer so we don't miss something out ... but let's just add json file alone .. I think it's more import to get rid of unrelated files ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Are you planning to address this in #46331 (or some other PR), or would you like me to take care of it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate if you make another PR :-)

HyukjinKwon added a commit that referenced this pull request May 2, 2024
…o PyPI package

### What changes were proposed in this pull request?

This PR is a followup of #44920 that includes `error-conditions.json` into PyPI package.

### Why are the changes needed?

So PyPI packages work with pip install. https://github.com/apache/spark/actions/runs/8913716987/job/24479781481 is broken for now.

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually tested the packaging, and check the file exists in the package.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46328 from HyukjinKwon/SPARK-46894-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request May 3, 2024
### What changes were proposed in this pull request?

Change the Python manifest so that tests are excluded from the packages that are built for distribution.

### Why are the changes needed?

Tests were unintentionally included in the distributions as part of #44920. See [this comment](https://github.com/apache/spark/pull/44920/files#r1586979834).

### Does this PR introduce _any_ user-facing change?

No, since #44920 hasn't been released to any users yet.

### How was this patch tested?

I built Python packages and inspected `SOURCES.txt` to confirm that tests were excluded:

```sh
cd python
rm -rf pyspark.egg-info || echo "No existing egg info file, skipping deletion"
python3 packaging/classic/setup.py sdist
python3 packaging/connect/setup.py sdist
find dist -name '*.tar.gz' | xargs -I _ tar xf _ --directory=dist
cd ..
open python/dist
find python/dist -name SOURCES.txt | xargs code
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46354 from nchammas/SPARK-48107-package-json.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…SON file

### What changes were proposed in this pull request?

Move PySpark error conditions into a standalone JSON file.

Adjust the instructions in `MANIFEST.in` so they package the new JSON file correctly.

### Why are the changes needed?

Having the JSON in its own file enables better IDE support for editing and managing the JSON. For example, VS Code will detect duplicate keys automatically and underline them.

This change also simplifies the logic to regenerate the JSON.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

1. Existing unit tests.
2. I confirmed that rewriting the JSON file works as intended:

    ```python
    from pyspark.errors.exceptions import _write_self
    _write_self()
    ```

3. I built a PySpark distribution, installed it, and confirmed the error class test still works.

    ```sh
    trash dist/
    ./dev/make-distribution.sh --pip -DskipTests -Phive

    # The JSON file is put in the dist/ directory.
    find dist -name 'error-conditions.json'
    # dist/python/pyspark/errors/error-conditions.json

    # It's also listed in SOURCES.txt.
    grep error-conditions.json dist/python/pyspark.egg-info/SOURCES.txt
    # pyspark/errors/error-conditions.json

    cd dist/python
    python -m venv venv
    source venv/bin/activate
    pip install .

    # It also gets installed into the newly created venv.
    find venv -name error-conditions.json
    # venv/lib/python3.11/site-packages/pyspark/errors/error-conditions.json

    # Running the test from inside the venv also succeeds.
    python pyspark/errors/tests/test_errors.py
    # Ran 3 tests in 0.000s
    #
    # OK
    ```

4. I repeated test 3, but this time used the generated ZIP to install PySpark in a fresh virtual environment:

    ```sh
    $ cd .../spark/
    $ trash dist
    $ ./dev/make-distribution.sh --pip -DskipTests -Phive
    $ cd .../Desktop/
    $ python -m venv venv
    $ source venv/bin/activate
    $ pip install .../spark/python/dist/pyspark-4.0.0.dev0.tar.gz
    $ find venv -name error-conditions.json
    venv/lib/python3.11/site-packages/pyspark/errors/error-conditions.json
    $ python
    Python 3.11.7 (main, Dec 11 2023, 02:35:11) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from pyspark.errors.error_classes import ERROR_CLASSES_MAP
    >>> len(ERROR_CLASSES_MAP)
    222
    ```

    You can see that the JSON file was packaged correctly and installed into the virtual environment, and `ERROR_CLASSES_MAP` loaded correctly as well.

5. I installed PySpark by including a different built ZIP in the `PYTHONPATH` as [described here][pyzip]:

    ```sh
    $ cd .../spark/
    $ trash dist
    $ ./dev/make-distribution.sh --pip -DskipTests -Phive
    $ cd .../Desktop/
    $ export SPARK_HOME=".../spark"
    $ export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
    $ python
    Python 3.11.7 (main, Dec 11 2023, 02:35:11) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from pyspark.errors.error_classes import ERROR_CLASSES_MAP
    >>> len(ERROR_CLASSES_MAP)
    222
    ```

[pyzip]: https://spark.apache.org/docs/3.5.1/api/python/getting_started/install.html#manually-downloading

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#44920 from nchammas/pyspark-error-json.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
### What changes were proposed in this pull request?

Minor fixes to the English of some comments I added in apache#44920.

### Why are the changes needed?

Proper English -- OK, not _proper_, but more correct at least -- makes things easier to read.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Not tested beyond CI.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46321 from nchammas/error-cond-typo.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…o PyPI package

### What changes were proposed in this pull request?

This PR is a followup of apache#44920 that includes `error-conditions.json` into PyPI package.

### Why are the changes needed?

So PyPI packages work with pip install. https://github.com/apache/spark/actions/runs/8913716987/job/24479781481 is broken for now.

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually tested the packaging, and check the file exists in the package.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46328 from HyukjinKwon/SPARK-46894-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
### What changes were proposed in this pull request?

Change the Python manifest so that tests are excluded from the packages that are built for distribution.

### Why are the changes needed?

Tests were unintentionally included in the distributions as part of apache#44920. See [this comment](https://github.com/apache/spark/pull/44920/files#r1586979834).

### Does this PR introduce _any_ user-facing change?

No, since apache#44920 hasn't been released to any users yet.

### How was this patch tested?

I built Python packages and inspected `SOURCES.txt` to confirm that tests were excluded:

```sh
cd python
rm -rf pyspark.egg-info || echo "No existing egg info file, skipping deletion"
python3 packaging/classic/setup.py sdist
python3 packaging/connect/setup.py sdist
find dist -name '*.tar.gz' | xargs -I _ tar xf _ --directory=dist
cd ..
open python/dist
find python/dist -name SOURCES.txt | xargs code
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46354 from nchammas/SPARK-48107-package-json.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants