Skip to content

Conversation

@AhyoungRyu
Copy link
Contributor

@AhyoungRyu AhyoungRyu commented Aug 18, 2016

What is this PR for?

Currently, Zeppelin's embedded Spark is located under interpreter/spark/.
For whom builds Zeppelin from source, this Spark is downloaded when they build the source with build profiles. I think this various build profiles are useful to customize the embedded Spark, but many Spark users are using their own Spark not Zeppelin's embedded one. Nowadays only Spark&Zeppelin beginners use this embedded Spark. For them, there are too many build profiles(it's so complicated i think).
In case of Zeppelin binary package, it's included by default under interpreter/spark/. That's why Zeppelin package size is so huge.

New suggestions

This PR will change the embedded Spark binary downloading mechanism like below.

  1. ./bin/zeppelin-daemon.sh get-spark or ./bin/zeppelin.sh get-spark
  2. create ZEPPELIN_HOME/local-spark/ and will download spark-2.0.1-hadoop2.7.bin.tgz and untar
  3. we can use this local spark without any configuration like before (e.g. setting SPARK_HOME)

What type of PR is it?

Improvement

Todos

  • - trap ctrl+c & ctrl+z key interruption during downloading Spark
  • - test in the different OS
  • - update related document pages again after get feedbacks

What is the Jira issue?

ZEPPELIN-1332

How should this be tested?

  1. rm -r spark-dependencies
  2. Apply this patch and build with mvn clean package -DskipTests
  3. trybin/zeppelin-daemon.sh get-spark or bin/zeppelin.sh get-spark
  4. should be able to run sc.version without setting external SPARK_HOME

Screenshots (if appropriate)

  • ./bin/zeppelin-daemon.sh get-spark
$ ./bin/zeppelin-daemon.sh get-spark
Download spark-2.0.1-bin-hadoop2.7.tgz from mirror ...

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  178M  100  178M    0     0  10.4M      0  0:00:17  0:00:17 --:--:-- 10.2M

spark-2.0.1-bin-hadoop2.7 is successfully downloaded and saved under /Users/ahyoungryu/Dev/zeppelin-development/zeppelin/local-spark
  • if ZEPPELIN_HOME/local-spark/spark-2.0.1-hadoop2.7 already exists
$ ./bin/zeppelin-daemon.sh get-spark
spark-2.0.1-bin-hadoop2.7 already exists under local-spark.

Questions:

  • Does the licenses files need update? no
  • Is there breaking changes for older versions? no
  • Does this needs documentation? Need to update some related documents (e.g. README.md, spark.md and install.md ?)

@AhyoungRyu AhyoungRyu closed this Aug 18, 2016
@AhyoungRyu AhyoungRyu reopened this Aug 18, 2016
@bzz
Copy link
Member

bzz commented Aug 19, 2016

@AhyoungRyu great initiative, but while making this changes, you have to think also about CI use case of zeppelin build as well.

I.e so far /.spark-dist/ is under cache on TravisCI which is S3 bucket that gets synced automatically with the content of this folder while running a build. It you un-tar the whole archive there - it will take forever to sync \w S3 and will defeat the purpose of cache on CI side, making build times longer.

If you ask me - I would say that before doing such big changes as refactoring of the build structure we all need very clear understanding and explanation of what is the benefit and what problem does this change solves.

So far I have not understood the answer to the questions above from PR description (may be my fault). But, in case of voting for such change, will make me at least -0 for if, if not -1 due to potential bugs that such changes will bring.

If that is reduction of convenience binary size - then we need to know how much does the size changes with the proposed changes to understand if that is worth. If that impacts CI build times - we also need to know how much.

Also regarding user experience - while running zeppelin-demon.sh user does not usually expect it to be network-dependant and download 100Mb archives - is there at least a user notification\progress indicator? Otherwise there going to be bug reports like "Zeppelin is not starting" as soon as such change is introduced.
And how about Windows users of Zeppelin? How about EMR\Dataproc\Juju\BigTop users, will the proposed change affect them?

Please take it with the grain of salt, and of course I will be happy to help addressing each item addressed one by one.

@AhyoungRyu
Copy link
Contributor Author

AhyoungRyu commented Aug 20, 2016

@bzz Thank you for such precise comment! Let me break down your feedback one by one(just for making it clear) :)

/.spark-dist/ is under cache on TravisCI which is S3 bucket that gets synced automatically with the content of this folder while running a build.

Right. That's my bad. I'll change the dir to another. Then how about ZEPPELIN_HOME/interpreter/spark/ as like before?

2, 3, 4.

what is the benefit and what problem does this change solves?

Actually I also tried to describe well about the current problem & the advantage of this change in Jira issue and the PR description, but i guess i didn't. I should've explain more clearly. Let me explain more in here with actual digit. (I'll update the Jira & PR description as well)

  • What was the problem?

As you said in the above, yes. The main problem is the Zeppelin binary package size. The latest version of Zeppelin bin size was

zeppelin-0.6.1-bin-all.tgz: 517MB
zeppelin-0.6.1-bin-netinst.tgz: 236MB

Didn't we ask ASF infra team(?) every release because of Zeppelin's huge package size?

  • What is the benefit?

When I created binary package without spark-dependencies, the each bin package size was

zeppelin-0.6.1-bin-all.tgz: 344MB
zeppelin-0.6.1-bin-netinst.tgz: 64MB

As you can see in the above those two cases' size diff is about 170MB! Moreover, users don't need to type build profiles i.e. -Pr or -Psparkr. I saw many users who are trying to use %sparkr in Zeppelin, they hit NPE because they didn't build with -Psparkr. It's truly confuse maybe they don't know well about the maven build mechanism. But with this change, they don't need to know about the complicating maven build profiles.

Also regarding user experience - while running zeppelin-demon.sh user does not usually expect it to be network-dependant and download 100Mb archives - is there at least a user notification\progress indicator

So far, I just added below line to show in console after users start zeppelin-daemon.sh

echo "There is no SPARK_HOME in your system. After successful Spark bin installation, Zeppelin will be started."

Then it starts downloading Spark binary from the mirror site. I'm planning to add some description to README as we have provided many build profiles information in there. I also agree there must be better way to notify that instead of just writing about "We will download 100MB Spark binary package if you don't set SPARK_HOME yet" on README.

how about Windows users of Zeppelin? How about EMR\Dataproc\Juju\BigTop users, will the proposed change affect them?

As you can see in the PR description TODO, I'm planning to create a download-spark.cmd file for Windows users. And regarding EMR\Dataproc\Juju\BigTop users, I didn't catch that actually. So need more time to figure them out i think.

After first I came up with removing spark-dependencies to reduce Zeppelin bin package size, I spent long time to think about how can we substitute the preexisting way seamlessly to provide embedded Spark in Zeppelin as like before. Please regard this PR as the first initiative. And will be appreciated if you can share your awesome idea about this issue! :)

@bzz
Copy link
Member

bzz commented Aug 26, 2016

Thank you for kind explanation and very good break-down of the feedback. I think you proposal and implementation with recent updates makes perfect sense.

Please keep up a good work and ping me back for the final review, once you think it's ready!

@AhyoungRyu
Copy link
Contributor Author

@bzz Thank you for saying so! Then I'll continue my work in here and let you know :)

bin/common.sh Outdated

# Text encoding for
function downloadSparkBinary() {
if [[ -z "${SPARK_HOME}" ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible issue is
SPARK_HOME can not only be defined in zeppelin-env.sh but also in GUI interpreter setting page (related bug is being handled by ZEPPELIN-1334).

If SPARK_HOME is not defined in zeppelin-env.sh but in interpreter setting page on GUI, this script will not recognize SPARK_HOME at this point. Because downloadSparkBinary() is invoked by either bin/zeppelin-daemon.sh or bin/zeppelin.sh, but SPARK_HOME defined from GUI will be propagated to bin/interpreter.sh only.

Another possible issue is
Some user may doesn't even have a spark interpreter installed. (e.g. netinst package and user installed jdbc interpreter only) In this case downloading spark binary because of SPARK_HOME is not defined doesn't make any sense.

Considering these two possible issue around this conditional statement, i suggest change this condition to check installation of spark interpreter, not check SPARK_HOME. for example

if [[ -d "${ZEPPELIN_HOME}/interpreter/spark" ]]; then

And inside of bin/download-spark.sh script, how about just ask user explicitly whether user want to download spark binary for local mode or not? And if user answers 'N' we can create a small text file somewhere under interpreter/spark to remember the answer.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Leemoonsoo Thank you for pointing the possible issues! Didn't know that parts actually. Checking the existence of ${ZEPPELIN_HOME}/interpreter/spark more makes sense than SPARK_HOME. And getting user's answer as well. Will update that two parts :)

@AhyoungRyu AhyoungRyu force-pushed the ZEPPELIN-1332 branch 5 times, most recently from 2302e78 to 7c33f17 Compare September 12, 2016 18:58
@AhyoungRyu AhyoungRyu changed the title [WIP][ZEPPELIN-1332] Remove spark-dependencies & suggest new way [ZEPPELIN-1332] Remove spark-dependencies & suggest new way Sep 14, 2016
@AhyoungRyu
Copy link
Contributor Author

AhyoungRyu commented Sep 14, 2016

@bzz @Leemoonsoo
Sorry for my late response. I spent some time to test various cases on the different OS.
I think it's ready for review(CI is green at last!).
I updated the PR description accordingly. Hope it helps you to remind the purpose of this PR :)

Here is the list of changes after my initial commits

  • Directory of Spark bin
    I changed the dir of Spark bin from interpreter/spark/ to local-spark. Since mvn clean will remove interpreter/, users will see “Do you want to download local Spark?” whenever they re-build and restart Zeppelin. So I think creating a new dir(local-spark) would be better in this case.
  • Supporting Windows users
    I wanted to create download-spark.cmd for Windows. But sadly we can’t use many shell commands such as curl, tar and sed in the batch script. It was hard to find 100% compatible commands for Windows. Maybe we can guide the Windows users to install those commands by themselves, but it’s a bit overdoing i think. Actually download-spark script is only for downloading the latest version of Spark and setting SPARK_HOME. So I updated some docs to explain this as a alternative way.
  • Documentations
    I think it’s quite big change that ppl need to enter “Yes/No” when they start Zeppelin. Even though it’s only one time. So I updated README.md, install.md and spark.md. For example, the below screenshot is spark.md.
    screen shot 2016-09-23 at 2 04 18 am

And as @bzz said before,

How about EMR\Dataproc\Juju\BigTop users, will the proposed change affect them?

Do we need to provide this local Spark mode for them? Actually it's my question.. :D
Please feel free to point me for anything if it's needed.

Copy link
Member

@Leemoonsoo Leemoonsoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this branch and it works well.

Previously, spark interpreter worked in local mode with zero configuration. i.e. even without conf/zeppelin-env.sh.

This change requires SPARK_HOME to make spark interpreter work, so no more zero configuration for spark interpreter, although shell script generates conf/zeppelin-env.sh and export SPARK_HOME.

@AhyoungRyu What do you think, will there any way to make spark interpreter work with zero configuration?

function save_local_spark() {
local answer

echo "There is no local Spark binary in ${ZEPPELIN_HOME}/${SPARK_CACHE}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think user might not know what 'local Spark' is and why they need it.
How about explain, 'to use spark interpreter in local mode (without external spark installation), spark binary need to be downloaded', or such way that gives user enough information to understand and make a decision?


function set_spark_home() {
local line_num
check_zeppelin_env
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After check_zeppelin_env, if there is already conf/zeppelin-env.sh exists, which means user explicitly created it, then i think we shouldn't change it.

@AhyoungRyu
Copy link
Contributor Author

AhyoungRyu commented Sep 17, 2016

@Leemoonsoo Thanks for your quick feedback!
The "zero configuration like before" makes sense. Let me update and will ping you again.

ZEPPELIN_INTP_CLASSPATH+=":${HADOOP_CONF_DIR}"
fi

export SPARK_CLASSPATH+=":${ZEPPELIN_INTP_CLASSPATH}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any special reasons removing above code blocks?

Copy link
Contributor Author

@AhyoungRyu AhyoungRyu Sep 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Leemoonsoo
There are two type of code blocks in the above. One is for exporting HADOOP_CONF_DIR and the other is SPARK_CLASSPATH. As you know, those two are for the order Spark version support.

  • Regarding the setting HADOOP_CONF_DIR is not a problem for this change since it's embraced in the if statement(maybe we can remove this in another PR as refactoring). Anyway I reverted this part in f5dcd04e.
  • But SPARK_CLASSPATH can conflict with SPARK_SUBMIT that I used in here. SPARK_CLASSPATH is deprecated after spark-1.x and SPARK_SUBMIT is recommended instead.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.0.0
      /_/

Using Scala version 2.10.4 (IBM J9 VM, Java 1.7.0)
WARN spark.SparkConf: 
SPARK_CLASSPATH was detected (set to 'path-to-proprietary-hadoop-lib/*:/path-to-proprietary-hadoop-lib/lib/*').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with --driver-class-path to augment the driver classpath
 - spark.executor.extraClassPath to augment the executor classpath

Please see this mail thread for the more information about this.
Since I set the local spark version as spark-2.0.0, we need to use only one : SPARK_CLASSPATH or SPARK_SUBMIT. That's why I removed the SPARK_CLASSPATH and set SPARK_SUBMIT.

@AhyoungRyu
Copy link
Contributor Author

AhyoungRyu commented Sep 22, 2016

CI is green now! I think this PR is working well as expected(at least to me haha). So ready for review again.
@moon If you're possible, could you please check this one again? :)

@AhyoungRyu
Copy link
Contributor Author

I think ZEPPELIN-1101 can also be resolved by this change.

It looks related to ZEPPELIN-1099 which is about removing dependencies from Spark. I think we don't need to build spark-dependencies by ourselves. we'd better support script to download spark binary and set SPARK_HOME. How about it?

@jongyoul As you replied like above in ZEPPELIN-1101, could you please take a look this one? :)

@AhyoungRyu
Copy link
Contributor Author

ping 👯

@jongyoul
Copy link
Member

@AhyoungRyu Thanks for your effort. LGTM. But I think it would be better to support non-interactive mode for running the server because some of users launches Zeppelin as a start-up service for their server and interactive mode would break this feature.

@AhyoungRyu
Copy link
Contributor Author

@jongyoul Thanks for your feedback! Yeah I didn't try to cover that case. So you mean we need to support ppl who are using this upstart option, am I right? :)

@tae-jun
Copy link
Contributor

tae-jun commented Nov 18, 2016

@AhyoungRyu Thanks for taking care of my feedback 😄

@AhyoungRyu
Copy link
Contributor Author

CI is green now, so ready for review.
I updated related docs again based on #1615 and @tae-jun 's feedback as well.

@bzz Could you take a look this again?
As I mentioned in this comment, I added You do not have neither local-spark, nor external SPARK_HOME set up.\nIf you want to use Spark interpreter, you need to run get-spark at least one time or set SPARK_HOME. This msg will be printed when the user starts Zeppelin if he doesn't have neither local-spark/ yet nor set external SPARK_HOME in his machine. Please see my latest commit :)

Maybe this msg can be removed in the future, when many Zeppelin users can get accustomed to this change.

@AhyoungRyu
Copy link
Contributor Author

ping 💃


- From 0.7, we don't use `ZEPPELIN_JAVA_OPTS` as default value of `ZEPPELIN_INTP_JAVA_OPTS` and also the same for `ZEPPELIN_MEM`/`ZEPPELIN_INTP_MEM`. If user want to configure the jvm opts of interpreter process, please set `ZEPPELIN_INTP_JAVA_OPTS` and `ZEPPELIN_INTP_MEM` explicitly. If you don't set `ZEPPELIN_INTP_MEM`, Zeppelin will set it to `-Xms1024m -Xmx1024m -XX:MaxPermSize=512m` by default.
- From 0.7, the support on Spark 1.1.x to 1.3.x is deprecated.
- Zeppelin embedded Spark won't work anymore. You need to run `./bin/zeppelin-daemon.sh get-spark` or `./bin/zeppelin.sh get-spark` at least one time. Please see [local Spark mode](../interpreter/spark.html#local-spark-mode) for more detailed information.
Copy link
Member

@bzz bzz Nov 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this looks like a substantial change, may be we could use a bit stronger language here - i.e

Apache Zeppelin releases do not come with Apache Spark build-in by default any more. 
In order to be able to run Apache Spark paragraphs, please either run `./bin/zeppelin-daemon.sh get-spark` or point `$SPARK_HOME` to Apache Spark installation.
See .... for more details ..

What do you think?

@bzz
Copy link
Member

bzz commented Nov 24, 2016

Thank you @AhyoungRyu for great job and taking care in addressing the user experience concerns!

@1ambda
Copy link
Member

1ambda commented Nov 24, 2016

Let me also review this great PR and then give some feedbacks 👍

@AhyoungRyu
Copy link
Contributor Author

@bzz Just updated upgrade.md as your feedback.
@1ambda Sure. Thanks! Please do :)

@AhyoungRyu AhyoungRyu closed this Nov 24, 2016
@AhyoungRyu AhyoungRyu reopened this Nov 24, 2016
@Leemoonsoo
Copy link
Member

In case of

  1. Don't have plan to use spark interpreter, just want to use other interpreters like python, big query.
  2. Set SPARK_HOME in interpreter property instead of conf/zeppelin-env.sh

User may not interested in local-spark. but user will keep seeing messages

Lees-MacBook:pr1339 moon$ bin/zeppelin-daemon.sh start

You do not have neither local-spark, nor external SPARK_HOME set up.
If you want to use Spark interpreter, you need to run get-spark at least one time or set SPARK_HOME.

Zeppelin start                                             [  OK  ]
Lees-MacBook:pr1339 moon$ bin/zeppelin-daemon.sh stop
Zeppelin stop                                              [  OK  ]
Lees-MacBook:pr1339 moon$ bin/zeppelin-daemon.sh start

You do not have neither local-spark, nor external SPARK_HOME set up.
If you want to use Spark interpreter, you need to run get-spark at least one time or set SPARK_HOME.

Zeppelin start                                             [  OK  ]

@AhyoungRyu What do you think?

@1ambda
Copy link
Member

1ambda commented Nov 29, 2016

Short summary and small thought about #1339

  1. Using symlink like local-spark/master would be safe i think. It enables users replace their own local spark without renaming directories. Currently we are using hard coded name and this will be different in each binary version.
SPARK_CACHE="local-spark"
SPARK_ARCHIVE="spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}"
  1. about UX,
  1. Now users need to type get-spark. it works as described
$  zeppelin-review git:(pr/1339) ./bin/zeppelin-daemon.sh start
Log dir doesn't exist, create /Users/1ambda/github/apache-zeppelin/zeppelin-review/logs
Pid dir doesn't exist, create /Users/1ambda/github/apache-zeppelin/zeppelin-review/run

You do not have neither local-spark, nor external SPARK_HOME set up.
If you want to use Spark interpreter, you need to run get-spark at least one time or set SPARK_HOME.

Zeppelin start                                             [  OK  ]
$  zeppelin-review git:(pr/1339) ./bin/zeppelin-daemon.sh stop
Zeppelin stop                                              [  OK  ]
$  zeppelin-review git:(pr/1339) ./bin/zeppelin-daemon.sh get-spark
Download spark-2.0.1-bin-hadoop2.7.tgz from mirror ...

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  178M  100  178M    0     0  7157k      0  0:00:25  0:00:25 --:--:-- 6953k

spark-2.0.1-bin-hadoop2.7 is successfully downloaded and saved under /Users/lambda/github/apache-zeppelin/zeppelin-review/local-spark

$  zeppelin-review git:(pr/1339) ./bin/zeppelin-daemon.sh start
Zeppelin start                                             [  OK  ]

@AhyoungRyu
Copy link
Contributor Author

AhyoungRyu commented Mar 22, 2017

I'm closing this PR since there'll be better solution for this (e.g. similar mechanism with ZEPPELIN-1993) :)

@AhyoungRyu AhyoungRyu closed this Mar 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants