Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions python/pyspark/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,53 @@
Finer-grained cache persistence levels.

"""
import os
import re
import sys

from os.path import isfile, join

import xml.etree.ElementTree as ET

if os.environ.get("SPARK_HOME") is None:
raise ImportError("Environment variable SPARK_HOME is undefined.")

spark_home = os.environ['SPARK_HOME']
pom_xml_file_path = join(spark_home, 'pom.xml')
snapshot_version = None

if isfile(pom_xml_file_path):
try:
tree = ET.parse(pom_xml_file_path)
root = tree.getroot()
version_tag = root[4].text
snapshot_version = version_tag[:5]
except:
raise ImportError("Could not read the spark version, because pom.xml file" +
" could not be read.")
else:
try:
lib_file_path = join(spark_home, "lib")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we will always need this branch, can we remove the other one (always find the version from assembly jar)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alope107 Can we go ahead and find the version only from the spark-assembly jar?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alope107 , would you mind updating this PR to remove the pom_xml_file_path branch? Thanks!

jars = [f for f in os.listdir(lib_file_path) if isfile(join(lib_file_path, f))]

for jar in jars:
m = re.match(r"^spark-assembly-([0-9\.]+).*\.jar$", jar)
if m is not None and len(m.groups()) > 0:
snapshot_version = m.group(1)

if snapshot_version is None:
raise ImportError("Could not read the spark version, because pom.xml or spark" +
" assembly jar could not be found.")
except OSError:
raise ImportError("Could not read the spark version, because pom.xml or lib directory" +
" could not be found in SPARK_HOME")


from pyspark.pyspark_version import __version__
if (snapshot_version != __version__):
raise ImportError("Incompatible version of Spark(%s) and PySpark(%s)." %
(snapshot_version, __version__))


from pyspark.conf import SparkConf
from pyspark.context import SparkContext
Expand Down
17 changes: 17 additions & 0 deletions python/pyspark/pyspark_version.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
__version__ = '1.5.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to source this from some existing place? That way we don't have to update the version string in multiple places. I forget where, but there should already be a central place where the version is set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any version that's specific to pyspark, only a version for spark as a whole. I agree that we don't want to set a version in multiple places, but I think the one I introduced is the only version unique to pyspark.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative, but trickier, idea would be to have mvn's pom.xml version be the authoritative one, but during the build process, it somehow adds or modifies that file to match the version (maybe using mvn resource filtering?). This would break being able to just "pip install -e python" in development mode, since people would remember to have to run the mvn command to sync the file over, but at least there is no risk of them going out of sync in the build.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I entirely follow. Are you suggesting that when Spark is built, Maven creates this pyspark_version file as a part of the build process? If so, how does this affect a user who installs from PyPI?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to build a sdist and wheel, so we can just make sure that whatever process we use adds that file in. Not sure if it's really worth the complexity at this moment, but my team does something internally such that our python and java code both get semantic versions based off of the latest tag and the git hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's error-prone to have multiple copy of version in different places, if someone forget to update his, PySpark will break (even within the repo).

I'd vote for generate the version during generating PyPI package. If PySpark came along with Spark, we don't need this check (at least it shouldn't fail or slow).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we remove the version checks entirely in the bundled version, and include them for the package uploaded to PyPI? I agree that this reduces the chance for maintainer error, but I'm worried about users upgrading versions of Spark. A user could install a bundled version of pyspark, and then later point their SPARK_HOME at a newer version of Spark. There would then be a version mismatch that wouldn't be detected.
Maybe a middle ground could be to include the version checks in both bundled and pip installations, but to include a check during PyPI package generation that the version has been properly set.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the version number specified for the scala side now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Could someone with more experience with that side of the project chime in?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of pyspark packaging the corresponding version of spark. As a user experience, this is cleaner, requires less steps, and is more natural/inline with other pip installable libraries. I have experience in packaging jars with python libraries in platform independent ways and would be happy to help if wanted.

22 changes: 22 additions & 0 deletions python/setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/usr/bin/env python

from setuptools import setup

exec(compile(open("pyspark/pyspark_version.py").read(),
"pyspark/pyspark_version.py", 'exec'))
VERSION = __version__

setup(name='pyspark',
version=VERSION,
description='Apache Spark Python API',
author='Spark Developers',
author_email='dev@spark.apache.org',
url='https://github.com/apache/spark/tree/master/python',
packages=['pyspark', 'pyspark.mllib', 'pyspark.ml', 'pyspark.sql', 'pyspark.streaming'],
install_requires=['py4j==0.9'],
extras_require = {
'ml': ['numpy>=1.7'],
'sql': ['pandas']
},
license='http://www.apache.org/licenses/LICENSE-2.0',
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe asking for too much, but in Sparkling Pandas we install our own assembly jar*, would it maybe make sense to do that as part of this process?

(*and getting it working has been painful, but doable).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with assembly jars, so please correct me if I'm wrong, but I think that we shouldn't need one for pyspark as it is entirely python code. Wouldn't we only need an assembly jar if we were also looking to package scala or java code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So by assembly JAR in this case I'd be refering to the Spark assembly jar (which we would want to package as an artifact along with submit scripts if we wanted to put this on pypi, but that might not be an immediate goal).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if SPARK_HOME was set, it would use that spark installation, and default to the packaged JAR otherwise? Depending on the size of the assembly JAR I be in favor of this as it makes installation very easy for those who only want to interact with Spark through pyspark, but the discussion on the mailing list seemed to intentionally shy away from too large of a PyPI package. I'll bring up your suggestion to see if there's wider support, and I encourage you to join the discussion here: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-td12626.html

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As was discussed on the list, I think it makes sense to hold off on the jar at first. It's definitely worth revisiting down the line though.