-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26803][PYTHON] Add sbin subdirectory to pyspark #23715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The sbin/ bits aren't used by end users; they're used on the cluster deployment side for standalone mode. I think that's why they're not included. sbin/ is of course included in the Spark distro. This affect the pyspark package, right? which would be for end users on the client side. |
|
My use case is wanting to run |
|
Those are in the binary or source distributions at http://spark.apache.org/downloads.html ; that's what you want if you're running Spark. The pyspark package is for calling Spark from an app. |
|
I'm using Spark within a python script. The logical thing then is to install spark through conda or pip, and indeed that's one of the options listed on the download page. This hasn't given me any problems so far, and I would expect this to be a natural workflow for other users. |
|
The pyspark pip package is the client side... I don't even think it would have the Spark binaries to run, like the history server. I don't think pip is a good way to distribute all that. It would be 100s of MB of JAR files that almost no app actually needs. You're meant to run Spark (which is more than the Pyspark wrapper) by getting a distribution of, well, all of Spark. EDIT: I looked at my pyspark install and it does actually have a bunch of JARs, hm. I presume this is for running in local mode, but here I'm out of my depth. Maybe that's good news? if you really mean to just run spark locally. But yeah for managing a standalone cluster or the history server, that's really beyond pyspark and you'd need a full distribution. |
|
The pyspark pip package is over 200 MB and includes JARs. I'm not sure whether that's all JARs, and what exactly you mean by 'client side', but it's enough to run spark locally. And like I said, the history server works fine if I manually download the |
|
The history server is useful in local mode as well. When you start out, you typically don't have a cluster to experiment on. And you don't know whether you're writing spark jobs correctly, so you want to look at how they are executed. And for that you need the history server, because the regular Spark ui disappears as soon as your job is finished. Frankly, I don't know how else users are supposed to gain an understanding of what they are doing. |
|
I basically agree with @srowen's point. PySpark installation basically should focuses on Python client side. Frankly, I don't think it even makes sense that a package in PyPi has non-python-related stuff. |
|
BTW, I know there are already, for instance, |
|
The client/server side distinction isn't relevant when you run spark locally. The pyspark package contains everything needed to I get the impression your real issue is with what's already contained in the pyspark package. If you want to change that, that's another discussion. It is not a valid argument against this PR. This PR adds 41KB of scripts to the pyspark package to actually make use of functionality that's already there. Have a look at https://spark.apache.org/docs/latest/monitoring.html and tell me how a user is supposed to tell this doesn't apply to pyspark. |
It is a valid argument since you're referring what's already contained in the Pyspark package. The current argument can be a subset of this argument. Practically I understand it can be useful. But |
It's valid if you submit an alternative PR that takes out everything you feel doesn't belong in the pyspark package, and argue that that supersedes this request. Otherwise your objection is impossible to argue against.
If you look at @holdenk's pr that created the pip package, you can see that the intention was to include all of spark that's required to run spark locally. That is the principle that should guide the decision whether to add the history server script. The history server is more than just 'useful'. It's essential for everyone new to spark who hasn't a clue how their job is actually executed and whether they wrote their script right. |
It doesn't necessarily fix it right away. That's going to break the previous usecases even if it is something to get rid of non-PySpark stuff and this factor should also be considered. I am saying we should stop doing this and hold off if we see this way isn't the right way. For clarification, it doesn't mean we should fix something only because the existing way does that.
I am saying the intention should be re-discussed. I couldn't foresee such request like this. PyPi is for Python package. CRAN is for R package.
The history server is optional. People can still use PySpark, no? |
|
Let's make the discussion short - it doesn't necessarily mean to fix. Looks practically adding the scripts can be useful but not sure if this is the right way we should do in principle. If this is one time thing, I won't stay against, if other committers prefer as well. |
|
Yep I'm pretty sure now that the pip package has all those jars so you can run locally, so 'pyspark' does anything at all without a cluster. Local mode is just for testing and experiments; Spark doesn't have much point on one machine for anything 'real'. I don't think running a history server with local execution is a reasonable use case as a result. Maybe more to the point: those scripts don't work without a Spark distro; they expect SPARK_HOME to be set or to be run from within an unpacked distribution. Maybe you'll shock me again by saying it really does happen to work with the pip package layout too, but I've never understood that to be supported or the intent, something that's being maintained. Does the script even work when packaged this way? |
I think the problem is that you and other developpers have been using spark for years and know exactly what you are doing. For someone new to spark, everything starts with testing and experiments and trying to understand how jobs are executed. For that, the history server is essential. Otherwise spark is just a black box, right? Let me give you a concrete example. I am implementing an algorithm which unfortunately involves a near-cross join. Using the history server I
Yes that's what I've been trying to tell you. I've manually downloaded the |
|
Can you just download the distribution in Spark website if you need other stuff then pyspark? |
|
The Spark UI is always available while running. You can run the history server if you want, with local mode, though I think that's rare; you just need the actual Spark distribution. I'm surprised it just works from the pip packaging, but that makes this a reasonable idea, to package the scripts that happen to work. They're small. I'm still not sure it's a good idea, as it's not on purpose and there is a well-established and intended way to run Spark, which is to get the Spark distro. |
The Spark UI disappears as soon as your app run is complete, so in practice it's really unpractical for evaluation.
You say that that's the "well-established and intended way to run Spark", but both the download page and the quickstart guide explicitly give you the option of installing pyspark as a pip package. I suspect that most Python developpers will jump to that option, because it's a one line installation, because it fits their workflow, and because they are working with conda or virtualenv environments. I think there is a communication issue here from Spark to its users, where at the one hand Spark offers a pip package that can in essence do everything, locally, but on the other hand you seem to be saying well actually, that package is not the right way to run Spark. |
|
I am not seeing what argument you're making here. PyPi has Python packages It should focuses on PySpark - users expect to install Python package from PyPi. It happened to have some unrelated files, or some mistakes. Yes, we should improve and clean up the documentation in that case. This is the baseline. Now, we (at least I) are considering this given that you're saying this might be practically helpful although it looks rather misusage of I don't think we should allow users to run |
|
I think this is around in circles. You expect pip to be a way to install all of Spark so you read the download page as advertising that. Pyspark is not Spark just as SparkR isn't either. These are client libraries. I'm surprised it works as much as it does as a Spark distro! But I have never heard this expectation otherwise and you are arguing with the people who publish this over what the intent of publishing it is. I am not going to merge this but don't strongly object if someone else does. There is a practical argument for adding it even if in theory it isn't meant to be there. |
|
I agree this is going a bit in circles now. I appreciate both of your taking the time to engage in this discussion and recognising that this has practical value. Let me just add that there's people on Stackoverflow being told that pyspark contains all of spark, while explicitly wondering why the |
|
So I think the original intent of the packaging is less important than supporting the usecases that folks are trying to do. I'm hesitant to package more scripts mostly because we don't have any particular testing for them to make sure they will work when packaged in this way. That being said, if we could put a small test which makes sure the history server can be started I can see how it would be useful for local debugging with PySpark and certainly shouldn't do too much harm. Perhaps we could also find a way to make What does everyone think? |
|
I'm not against it as it happens to work. A warning would be nice if possible. Do the other scripts actually work, not just history server ones? Maybe could restrict it to what is known to happen to work |
|
I think all the other scripts specifically deal with standalone clusters. It seems cleaner to me to just build pyspark with the whole re tests: is there a test for the ui server which we could mimic? |
|
Maybe it's conservative to just include the history server start/stop script, and spark-config.sh? yes the others are about running a standalone cluster. |
|
@oulenz up to you, we can proceed as per my last comment if that sounds good to you |
|
Yes, that works for me. I've pushed a new version that only includes |
|
Test build #4574 has finished for PR 23715 at commit
|
|
Merged to master |
What changes were proposed in this pull request?
Modifies
setup.pyso thatsbinsubdirectory is included in pysparkHow was this patch tested?
Manually tested with python 2.7 and python 3.7
$ ./build/mvn -D skipTests -P hive -P hive-thriftserver -P yarn -P mesos clean package $ cd python $ python setup.py sdist $ pip install dist/pyspark-2.1.0.dev0.tar.gzChecked manually that
sbinis now present in install directory.@srowen @holdenk