-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[ZEPPELIN-1332] Remove spark-dependencies & suggest new way #1339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@AhyoungRyu great initiative, but while making this changes, you have to think also about CI use case of zeppelin build as well. I.e so far If you ask me - I would say that before doing such big changes as refactoring of the build structure we all need very clear understanding and explanation of So far I have not understood the answer to the questions above from PR description (may be my fault). But, in case of voting for such change, will make me at least If that is reduction of convenience binary size - then we need to know how much does the size changes with the proposed changes to understand if that is worth. If that impacts CI build times - we also need to know how much. Also regarding user experience - while running Please take it with the grain of salt, and of course I will be happy to help addressing each item addressed one by one. |
|
@bzz Thank you for such precise comment! Let me break down your feedback one by one(just for making it clear) :)
Right. That's my bad. I'll change the dir to another. Then how about 2, 3, 4.
Actually I also tried to describe well about the current problem & the advantage of this change in Jira issue and the PR description, but i guess i didn't. I should've explain more clearly. Let me explain more in here with actual digit. (I'll update the Jira & PR description as well)
As you said in the above, yes. The main problem is the Zeppelin binary package size. The latest version of Zeppelin bin size was Didn't we ask ASF infra team(?) every release because of Zeppelin's huge package size?
When I created binary package without As you can see in the above those two cases' size diff is about
So far, I just added below line to show in console after users start Then it starts downloading Spark binary from the mirror site. I'm planning to add some description to README as we have provided many build profiles information in there. I also agree there must be better way to notify that instead of just writing about "We will download 100MB Spark binary package if you don't set SPARK_HOME yet" on README.
As you can see in the PR description After first I came up with removing |
|
Thank you for kind explanation and very good break-down of the feedback. I think you proposal and implementation with recent updates makes perfect sense. Please keep up a good work and ping me back for the final review, once you think it's ready! |
|
@bzz Thank you for saying so! Then I'll continue my work in here and let you know :) |
bin/common.sh
Outdated
|
|
||
| # Text encoding for | ||
| function downloadSparkBinary() { | ||
| if [[ -z "${SPARK_HOME}" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible issue is
SPARK_HOME can not only be defined in zeppelin-env.sh but also in GUI interpreter setting page (related bug is being handled by ZEPPELIN-1334).
If SPARK_HOME is not defined in zeppelin-env.sh but in interpreter setting page on GUI, this script will not recognize SPARK_HOME at this point. Because downloadSparkBinary() is invoked by either bin/zeppelin-daemon.sh or bin/zeppelin.sh, but SPARK_HOME defined from GUI will be propagated to bin/interpreter.sh only.
Another possible issue is
Some user may doesn't even have a spark interpreter installed. (e.g. netinst package and user installed jdbc interpreter only) In this case downloading spark binary because of SPARK_HOME is not defined doesn't make any sense.
Considering these two possible issue around this conditional statement, i suggest change this condition to check installation of spark interpreter, not check SPARK_HOME. for example
if [[ -d "${ZEPPELIN_HOME}/interpreter/spark" ]]; then
And inside of bin/download-spark.sh script, how about just ask user explicitly whether user want to download spark binary for local mode or not? And if user answers 'N' we can create a small text file somewhere under interpreter/spark to remember the answer.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Leemoonsoo Thank you for pointing the possible issues! Didn't know that parts actually. Checking the existence of ${ZEPPELIN_HOME}/interpreter/spark more makes sense than SPARK_HOME. And getting user's answer as well. Will update that two parts :)
2302e78 to
7c33f17
Compare
|
@bzz @Leemoonsoo Here is the list of changes after my initial commits
And as @bzz said before,
Do we need to provide this local Spark mode for them? Actually it's my question.. :D |
Leemoonsoo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this branch and it works well.
Previously, spark interpreter worked in local mode with zero configuration. i.e. even without conf/zeppelin-env.sh.
This change requires SPARK_HOME to make spark interpreter work, so no more zero configuration for spark interpreter, although shell script generates conf/zeppelin-env.sh and export SPARK_HOME.
@AhyoungRyu What do you think, will there any way to make spark interpreter work with zero configuration?
bin/download-spark.sh
Outdated
| function save_local_spark() { | ||
| local answer | ||
|
|
||
| echo "There is no local Spark binary in ${ZEPPELIN_HOME}/${SPARK_CACHE}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think user might not know what 'local Spark' is and why they need it.
How about explain, 'to use spark interpreter in local mode (without external spark installation), spark binary need to be downloaded', or such way that gives user enough information to understand and make a decision?
bin/download-spark.sh
Outdated
|
|
||
| function set_spark_home() { | ||
| local line_num | ||
| check_zeppelin_env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After check_zeppelin_env, if there is already conf/zeppelin-env.sh exists, which means user explicitly created it, then i think we shouldn't change it.
|
@Leemoonsoo Thanks for your quick feedback! |
78b6a35 to
2d5bafd
Compare
| ZEPPELIN_INTP_CLASSPATH+=":${HADOOP_CONF_DIR}" | ||
| fi | ||
|
|
||
| export SPARK_CLASSPATH+=":${ZEPPELIN_INTP_CLASSPATH}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any special reasons removing above code blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Leemoonsoo
There are two type of code blocks in the above. One is for exporting HADOOP_CONF_DIR and the other is SPARK_CLASSPATH. As you know, those two are for the order Spark version support.
- Regarding the setting
HADOOP_CONF_DIRis not a problem for this change since it's embraced in the if statement(maybe we can remove this in another PR as refactoring). Anyway I reverted this part inf5dcd04e. - But
SPARK_CLASSPATHcan conflict withSPARK_SUBMITthat I used in here.SPARK_CLASSPATHis deprecated afterspark-1.xandSPARK_SUBMITis recommended instead.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.0.0
/_/
Using Scala version 2.10.4 (IBM J9 VM, Java 1.7.0)
WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to 'path-to-proprietary-hadoop-lib/*:/path-to-proprietary-hadoop-lib/lib/*').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --driver-class-path to augment the driver classpath
- spark.executor.extraClassPath to augment the executor classpath
Please see this mail thread for the more information about this.
Since I set the local spark version as spark-2.0.0, we need to use only one : SPARK_CLASSPATH or SPARK_SUBMIT. That's why I removed the SPARK_CLASSPATH and set SPARK_SUBMIT.
cb8d5e5 to
d089d29
Compare
|
CI is green now! I think this PR is working well as expected(at least to me haha). So ready for review again. |
|
I think ZEPPELIN-1101 can also be resolved by this change.
@jongyoul As you replied like above in ZEPPELIN-1101, could you please take a look this one? :) |
53b0e13 to
6302d5c
Compare
|
ping 👯 |
|
@AhyoungRyu Thanks for your effort. LGTM. But I think it would be better to support non-interactive mode for running the server because some of users launches Zeppelin as a start-up service for their server and interactive mode would break this feature. |
|
@jongyoul Thanks for your feedback! Yeah I didn't try to cover that case. So you mean we need to support ppl who are using this upstart option, am I right? :) |
6302d5c to
c90d32c
Compare
b7bc005 to
6b2fe43
Compare
895212e to
5680fed
Compare
|
@AhyoungRyu Thanks for taking care of my feedback 😄 |
bbc07de to
2747d9e
Compare
|
CI is green now, so ready for review. @bzz Could you take a look this again? Maybe this msg can be removed in the future, when many Zeppelin users can get accustomed to this change. |
a7c7ede to
2747d9e
Compare
|
ping 💃 |
docs/install/upgrade.md
Outdated
|
|
||
| - From 0.7, we don't use `ZEPPELIN_JAVA_OPTS` as default value of `ZEPPELIN_INTP_JAVA_OPTS` and also the same for `ZEPPELIN_MEM`/`ZEPPELIN_INTP_MEM`. If user want to configure the jvm opts of interpreter process, please set `ZEPPELIN_INTP_JAVA_OPTS` and `ZEPPELIN_INTP_MEM` explicitly. If you don't set `ZEPPELIN_INTP_MEM`, Zeppelin will set it to `-Xms1024m -Xmx1024m -XX:MaxPermSize=512m` by default. | ||
| - From 0.7, the support on Spark 1.1.x to 1.3.x is deprecated. | ||
| - Zeppelin embedded Spark won't work anymore. You need to run `./bin/zeppelin-daemon.sh get-spark` or `./bin/zeppelin.sh get-spark` at least one time. Please see [local Spark mode](../interpreter/spark.html#local-spark-mode) for more detailed information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this looks like a substantial change, may be we could use a bit stronger language here - i.e
Apache Zeppelin releases do not come with Apache Spark build-in by default any more.
In order to be able to run Apache Spark paragraphs, please either run `./bin/zeppelin-daemon.sh get-spark` or point `$SPARK_HOME` to Apache Spark installation.
See .... for more details ..
What do you think?
|
Thank you @AhyoungRyu for great job and taking care in addressing the user experience concerns! |
|
Let me also review this great PR and then give some feedbacks 👍 |
|
In case of
User may not interested in local-spark. but user will keep seeing messages @AhyoungRyu What do you think? |
|
Short summary and small thought about #1339
|
|
I'm closing this PR since there'll be better solution for this (e.g. similar mechanism with ZEPPELIN-1993) :) |

What is this PR for?
Currently, Zeppelin's embedded Spark is located under
interpreter/spark/.For whom builds Zeppelin from source, this Spark is downloaded when they build the source with build profiles. I think this various build profiles are useful to customize the embedded Spark, but many Spark users are using their own Spark not Zeppelin's embedded one. Nowadays only Spark&Zeppelin beginners use this embedded Spark. For them, there are too many build profiles(it's so complicated i think).
In case of Zeppelin binary package, it's included by default under
interpreter/spark/. That's why Zeppelin package size is so huge.New suggestions
This PR will change the embedded Spark binary downloading mechanism like below.
./bin/zeppelin-daemon.sh get-sparkor./bin/zeppelin.sh get-sparkZEPPELIN_HOME/local-spark/and will downloadspark-2.0.1-hadoop2.7.bin.tgzand untarSPARK_HOME)What type of PR is it?
Improvement
Todos
ctrl+c&ctrl+zkey interruption during downloading SparkWhat is the Jira issue?
ZEPPELIN-1332
How should this be tested?
rm -r spark-dependenciesmvn clean package -DskipTestsbin/zeppelin-daemon.sh get-sparkorbin/zeppelin.sh get-sparksc.versionwithout setting externalSPARK_HOMEScreenshots (if appropriate)
./bin/zeppelin-daemon.sh get-sparkZEPPELIN_HOME/local-spark/spark-2.0.1-hadoop2.7already existsQuestions: