Skip to content

Conversation

@oulenz
Copy link
Contributor

@oulenz oulenz commented Jan 31, 2019

What changes were proposed in this pull request?

Modifies setup.py so that sbin subdirectory is included in pyspark

How was this patch tested?

Manually tested with python 2.7 and python 3.7

$ ./build/mvn -D skipTests -P hive -P hive-thriftserver -P yarn -P mesos clean package
$ cd python
$ python setup.py sdist
$ pip install  dist/pyspark-2.1.0.dev0.tar.gz

Checked manually that sbin is now present in install directory.

@srowen @holdenk

@srowen
Copy link
Member

srowen commented Jan 31, 2019

The sbin/ bits aren't used by end users; they're used on the cluster deployment side for standalone mode. I think that's why they're not included. sbin/ is of course included in the Spark distro. This affect the pyspark package, right? which would be for end users on the client side.

@oulenz
Copy link
Contributor Author

oulenz commented Jan 31, 2019

My use case is wanting to run start-history-server.sh as per https://spark.apache.org/docs/latest/monitoring.html. I ended up cloning the spark repo and copying the sbin subdirectory manually, which was not intuitive. I'm not sure whether there's a preferred alternative?

@srowen
Copy link
Member

srowen commented Jan 31, 2019

Those are in the binary or source distributions at http://spark.apache.org/downloads.html ; that's what you want if you're running Spark. The pyspark package is for calling Spark from an app.

@oulenz
Copy link
Contributor Author

oulenz commented Jan 31, 2019

I'm using Spark within a python script. The logical thing then is to install spark through conda or pip, and indeed that's one of the options listed on the download page. This hasn't given me any problems so far, and I would expect this to be a natural workflow for other users.

@srowen
Copy link
Member

srowen commented Jan 31, 2019

The pyspark pip package is the client side... I don't even think it would have the Spark binaries to run, like the history server. I don't think pip is a good way to distribute all that. It would be 100s of MB of JAR files that almost no app actually needs.

You're meant to run Spark (which is more than the Pyspark wrapper) by getting a distribution of, well, all of Spark.

EDIT: I looked at my pyspark install and it does actually have a bunch of JARs, hm. I presume this is for running in local mode, but here I'm out of my depth. Maybe that's good news? if you really mean to just run spark locally. But yeah for managing a standalone cluster or the history server, that's really beyond pyspark and you'd need a full distribution.

@oulenz
Copy link
Contributor Author

oulenz commented Jan 31, 2019

The pyspark pip package is over 200 MB and includes JARs. I'm not sure whether that's all JARs, and what exactly you mean by 'client side', but it's enough to run spark locally. And like I said, the history server works fine if I manually download the sbin subdirectory and run start-history-server.sh.

@oulenz
Copy link
Contributor Author

oulenz commented Jan 31, 2019

The history server is useful in local mode as well. When you start out, you typically don't have a cluster to experiment on. And you don't know whether you're writing spark jobs correctly, so you want to look at how they are executed. And for that you need the history server, because the regular Spark ui disappears as soon as your job is finished. Frankly, I don't know how else users are supposed to gain an understanding of what they are doing.

@HyukjinKwon
Copy link
Member

I basically agree with @srowen's point. PySpark installation basically should focuses on Python client side. start-history-server.sh isn't even made of Python library.

Frankly, I don't think it even makes sense that a package in PyPi has non-python-related stuff.

@HyukjinKwon
Copy link
Member

BTW, I know there are already, for instance, sparkR shell via bin but I even doubt if it makes sense to keep it there.

@oulenz
Copy link
Contributor Author

oulenz commented Feb 1, 2019

The client/server side distinction isn't relevant when you run spark locally. The pyspark package contains everything needed to spark-submit a job locally. It automatically starts the ui at localhost:4040. It already contains everything needed to launch the history server, except the actual script.

I get the impression your real issue is with what's already contained in the pyspark package. If you want to change that, that's another discussion. It is not a valid argument against this PR.

This PR adds 41KB of scripts to the pyspark package to actually make use of functionality that's already there. Have a look at https://spark.apache.org/docs/latest/monitoring.html and tell me how a user is supposed to tell this doesn't apply to pyspark.

@HyukjinKwon
Copy link
Member

I get the impression your real issue is with what's already contained in the pyspark package. If you want to change that, that's another discussion. It is not a valid argument against this PR

It is a valid argument since you're referring what's already contained in the Pyspark package. The current argument can be a subset of this argument.

Practically I understand it can be useful. But pip install pyspark not Spark. It looks we're going ahead in a weird way. I can see one time thing can be done if other committers prefer but this looks something we shouldn't do in principal. Adding @felixcheung and @holdenk to ask more opinions.

@oulenz
Copy link
Contributor Author

oulenz commented Feb 1, 2019

I get the impression your real issue is with what's already contained in the pyspark package. If you want to change that, that's another discussion. It is not a valid argument against this PR

It is a valid argument since you're referring what's already contained in the Pyspark package. The current argument can be a subset of this argument.

It's valid if you submit an alternative PR that takes out everything you feel doesn't belong in the pyspark package, and argue that that supersedes this request. Otherwise your objection is impossible to argue against.

Practically I understand it can be useful. But pip install pyspark not Spark. It looks we're going ahead in a weird way. I can see one time thing can be done if other committers prefer but this looks something we shouldn't do in principal. Adding @felixcheung and @holdenk to ask more opinions.

If you look at @holdenk's pr that created the pip package, you can see that the intention was to include all of spark that's required to run spark locally. That is the principle that should guide the decision whether to add the history server script.

The history server is more than just 'useful'. It's essential for everyone new to spark who hasn't a clue how their job is actually executed and whether they wrote their script right.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 1, 2019

It's valid if you submit an alternative PR that takes out everything you feel doesn't belong in the pyspark package, and argue that that supersedes this request. Otherwise your objection is impossible to argue against.

It doesn't necessarily fix it right away. That's going to break the previous usecases even if it is something to get rid of non-PySpark stuff and this factor should also be considered. I am saying we should stop doing this and hold off if we see this way isn't the right way.

For clarification, it doesn't mean we should fix something only because the existing way does that.

If you look at @holdenk's pr that created the pip package, you can see that the intention was to include all of spark that's required to run spark locally. That is the principle that should guide the decision whether to add the history server script.

I am saying the intention should be re-discussed. I couldn't foresee such request like this. PyPi is for Python package. CRAN is for R package.

The history server is more than just 'useful'. It's essential for everyone new to spark who hasn't a clue how their job is actually executed and whether they wrote their script right.

The history server is optional. People can still use PySpark, no?

@HyukjinKwon
Copy link
Member

Let's make the discussion short - it doesn't necessarily mean to fix. Looks practically adding the scripts can be useful but not sure if this is the right way we should do in principle.

If this is one time thing, I won't stay against, if other committers prefer as well.

@srowen
Copy link
Member

srowen commented Feb 1, 2019

Yep I'm pretty sure now that the pip package has all those jars so you can run locally, so 'pyspark' does anything at all without a cluster. Local mode is just for testing and experiments; Spark doesn't have much point on one machine for anything 'real'. I don't think running a history server with local execution is a reasonable use case as a result.

Maybe more to the point: those scripts don't work without a Spark distro; they expect SPARK_HOME to be set or to be run from within an unpacked distribution. Maybe you'll shock me again by saying it really does happen to work with the pip package layout too, but I've never understood that to be supported or the intent, something that's being maintained.

Does the script even work when packaged this way?

@oulenz
Copy link
Contributor Author

oulenz commented Feb 2, 2019

Yep I'm pretty sure now that the pip package has all those jars so you can run locally, so 'pyspark' does anything at all without a cluster. Local mode is just for testing and experiments; Spark doesn't have much point on one machine for anything 'real'. I don't think running a history server with local execution is a reasonable use case as a result.

I think the problem is that you and other developpers have been using spark for years and know exactly what you are doing. For someone new to spark, everything starts with testing and experiments and trying to understand how jobs are executed. For that, the history server is essential. Otherwise spark is just a black box, right?

Let me give you a concrete example. I am implementing an algorithm which unfortunately involves a near-cross join. Using the history server I

  • could confirm that the large join was indeed the stage that was taking most time
  • discovered that spark was using only a few large tasks for the join that were waiting on each other, so I had to repartition the tables involved
  • realised that spark throws away dataframes once they are used, so the computationally intensive join was actually calculated twice, and so I learned about caching

Maybe more to the point: those scripts don't work without a Spark distro; they expect SPARK_HOME to be set or to be run from within an unpacked distribution. Maybe you'll shock me again by saying it really does happen to work with the pip package layout too, but I've never understood that to be supported or the intent, something that's being maintained.

Does the script even work when packaged this way?

Yes that's what I've been trying to tell you. I've manually downloaded the sbin folder into my pyspark folder and have been using the history server no problemo. Otherwise I wouldn't be proposing this PR.

@HyukjinKwon
Copy link
Member

Can you just download the distribution in Spark website if you need other stuff then pyspark?

@srowen
Copy link
Member

srowen commented Feb 2, 2019

The Spark UI is always available while running. You can run the history server if you want, with local mode, though I think that's rare; you just need the actual Spark distribution.

I'm surprised it just works from the pip packaging, but that makes this a reasonable idea, to package the scripts that happen to work. They're small. I'm still not sure it's a good idea, as it's not on purpose and there is a well-established and intended way to run Spark, which is to get the Spark distro.

@oulenz
Copy link
Contributor Author

oulenz commented Feb 2, 2019

The Spark UI is always available while running. You can run the history server if you want, with local mode, though I think that's rare; you just need the actual Spark distribution.

The Spark UI disappears as soon as your app run is complete, so in practice it's really unpractical for evaluation.

I'm surprised it just works from the pip packaging, but that makes this a reasonable idea, to package the scripts that happen to work. They're small. I'm still not sure it's a good idea, as it's not on purpose and there is a well-established and intended way to run Spark, which is to get the Spark distro.

You say that that's the "well-established and intended way to run Spark", but both the download page and the quickstart guide explicitly give you the option of installing pyspark as a pip package. I suspect that most Python developpers will jump to that option, because it's a one line installation, because it fits their workflow, and because they are working with conda or virtualenv environments.

I think there is a communication issue here from Spark to its users, where at the one hand Spark offers a pip package that can in essence do everything, locally, but on the other hand you seem to be saying well actually, that package is not the right way to run Spark.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 2, 2019

I am not seeing what argument you're making here. PyPi has Python packages It should focuses on PySpark - users expect to install Python package from PyPi. It happened to have some unrelated files, or some mistakes. Yes, we should improve and clean up the documentation in that case. This is the baseline.

Now, we (at least I) are considering this given that you're saying this might be practically helpful although it looks rather misusage of pip install pyspark.

I don't think we should allow users to run pip install pyspark in each node to run a Spark cluster.

@srowen
Copy link
Member

srowen commented Feb 2, 2019

I think this is around in circles. You expect pip to be a way to install all of Spark so you read the download page as advertising that. Pyspark is not Spark just as SparkR isn't either. These are client libraries. I'm surprised it works as much as it does as a Spark distro! But I have never heard this expectation otherwise and you are arguing with the people who publish this over what the intent of publishing it is.

I am not going to merge this but don't strongly object if someone else does. There is a practical argument for adding it even if in theory it isn't meant to be there.

@oulenz
Copy link
Contributor Author

oulenz commented Feb 3, 2019

I agree this is going a bit in circles now. I appreciate both of your taking the time to engage in this discussion and recognising that this has practical value.

Let me just add that there's people on Stackoverflow being told that pyspark contains all of spark, while explicitly wondering why the sbin folder isn't included. And that there's been a PR to add the sbin scripts to the conda package of pyspark.

@holdenk
Copy link
Contributor

holdenk commented Feb 8, 2019

So I think the original intent of the packaging is less important than supporting the usecases that folks are trying to do.

I'm hesitant to package more scripts mostly because we don't have any particular testing for them to make sure they will work when packaged in this way. That being said, if we could put a small test which makes sure the history server can be started I can see how it would be useful for local debugging with PySpark and certainly shouldn't do too much harm.

Perhaps we could also find a way to make This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to set up your own standalone Spark cluster. You can download the full version of Spark from the [Apache Spark downloads page](http://spark.apache.org/downloads.html). print as a warning for sbin PyPi packaged scripts so folks are less likely to use them incorrectly.

What does everyone think?

@srowen
Copy link
Member

srowen commented Feb 8, 2019

I'm not against it as it happens to work. A warning would be nice if possible. Do the other scripts actually work, not just history server ones? Maybe could restrict it to what is known to happen to work

@oulenz
Copy link
Contributor Author

oulenz commented Feb 8, 2019

I think all the other scripts specifically deal with standalone clusters. It seems cleaner to me to just build pyspark with the whole sbin folder, but it's the same to me if you want to exclude everything but the start/stop history server scripts.

re tests: is there a test for the ui server which we could mimic?

@srowen
Copy link
Member

srowen commented Feb 8, 2019

Maybe it's conservative to just include the history server start/stop script, and spark-config.sh? yes the others are about running a standalone cluster.

@srowen
Copy link
Member

srowen commented Feb 14, 2019

@oulenz up to you, we can proceed as per my last comment if that sounds good to you

@oulenz
Copy link
Contributor Author

oulenz commented Feb 26, 2019

Yes, that works for me. I've pushed a new version that only includes start-history-server.sh, stop-history-server.sh, spark-config.sh and spark-daemon.sh.

@SparkQA
Copy link

SparkQA commented Feb 26, 2019

Test build #4574 has finished for PR 23715 at commit 4185ce6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Feb 27, 2019

Merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants