-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-1267][PYSPARK] Adds pip installer for pyspark #8318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
a288923
5e02ee1
c71fb5a
7ad1a8d
55d7602
ff43762
3b864ce
794daf2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # | ||
| # Licensed to the Apache Software Foundation (ASF) under one or more | ||
| # contributor license agreements. See the NOTICE file distributed with | ||
| # this work for additional information regarding copyright ownership. | ||
| # The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| # (the "License"); you may not use this file except in compliance with | ||
| # the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # | ||
| __version__ = '1.5.0' | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a way to source this from some existing place? That way we don't have to update the version string in multiple places. I forget where, but there should already be a central place where the version is set.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not seeing any version that's specific to pyspark, only a version for spark as a whole. I agree that we don't want to set a version in multiple places, but I think the one I introduced is the only version unique to pyspark. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An alternative, but trickier, idea would be to have mvn's pom.xml version be the authoritative one, but during the build process, it somehow adds or modifies that file to match the version (maybe using mvn resource filtering?). This would break being able to just "pip install -e python" in development mode, since people would remember to have to run the mvn command to sync the file over, but at least there is no risk of them going out of sync in the build.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I entirely follow. Are you suggesting that when Spark is built, Maven creates this pyspark_version file as a part of the build process? If so, how does this affect a user who installs from PyPI? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We still need to build a sdist and wheel, so we can just make sure that whatever process we use adds that file in. Not sure if it's really worth the complexity at this moment, but my team does something internally such that our python and java code both get semantic versions based off of the latest tag and the git hash.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's error-prone to have multiple copy of version in different places, if someone forget to update his, PySpark will break (even within the repo). I'd vote for generate the version during generating PyPI package. If PySpark came along with Spark, we don't need this check (at least it shouldn't fail or slow).
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So we remove the version checks entirely in the bundled version, and include them for the package uploaded to PyPI? I agree that this reduces the chance for maintainer error, but I'm worried about users upgrading versions of Spark. A user could install a bundled version of pyspark, and then later point their SPARK_HOME at a newer version of Spark. There would then be a version mismatch that wouldn't be detected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How is the version number specified for the scala side now?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure. Could someone with more experience with that side of the project chime in? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am in favor of pyspark packaging the corresponding version of spark. As a user experience, this is cleaner, requires less steps, and is more natural/inline with other pip installable libraries. I have experience in packaging jars with python libraries in platform independent ways and would be happy to help if wanted. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| #!/usr/bin/env python | ||
|
|
||
| from setuptools import setup | ||
|
|
||
| exec(compile(open("pyspark/pyspark_version.py").read(), | ||
| "pyspark/pyspark_version.py", 'exec')) | ||
| VERSION = __version__ | ||
|
|
||
| setup(name='pyspark', | ||
| version=VERSION, | ||
| description='Apache Spark Python API', | ||
| author='Spark Developers', | ||
| author_email='dev@spark.apache.org', | ||
| url='https://github.com/apache/spark/tree/master/python', | ||
| packages=['pyspark', 'pyspark.mllib', 'pyspark.ml', 'pyspark.sql', 'pyspark.streaming'], | ||
| install_requires=['py4j==0.9'], | ||
| extras_require = { | ||
| 'ml': ['numpy>=1.7'], | ||
| 'sql': ['pandas'] | ||
| }, | ||
| license='http://www.apache.org/licenses/LICENSE-2.0', | ||
| ) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is maybe asking for too much, but in Sparkling Pandas we install our own assembly jar*, would it maybe make sense to do that as part of this process? (*and getting it working has been painful, but doable).
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not familiar with assembly jars, so please correct me if I'm wrong, but I think that we shouldn't need one for pyspark as it is entirely python code. Wouldn't we only need an assembly jar if we were also looking to package scala or java code?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So by assembly JAR in this case I'd be refering to the Spark assembly jar (which we would want to package as an artifact along with submit scripts if we wanted to put this on pypi, but that might not be an immediate goal).
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So if SPARK_HOME was set, it would use that spark installation, and default to the packaged JAR otherwise? Depending on the size of the assembly JAR I be in favor of this as it makes installation very easy for those who only want to interact with Spark through pyspark, but the discussion on the mailing list seemed to intentionally shy away from too large of a PyPI package. I'll bring up your suggestion to see if there's wider support, and I encourage you to join the discussion here: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-td12626.html
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As was discussed on the list, I think it makes sense to hold off on the jar at first. It's definitely worth revisiting down the line though. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we will always need this branch, can we remove the other one (always find the version from assembly jar)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alope107 Can we go ahead and find the version only from the spark-assembly jar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alope107 , would you mind updating this PR to remove the pom_xml_file_path branch? Thanks!