Adding a working Docker setup for developing sparkmagic #361

apetresc · 2017-05-24T21:04:02Z

It includes the Jupyter notebook as well as the Livy+Spark endpoint.
Documentation is in the README.

Tested and works on my own machine (which, since this is Docker, means it should work anywhere). You just docker-compose build && docker-compose up and then launch http://localhost:8888 and add http://spark:8998 as a Livy endpoint from %manage_spark.

It includes the Jupyter notebook as well as the Livy+Spark endpoint. Documentation is in the README

Now you can just launch a PySpark wrapper kernel and have it work out of the box.

aggFTW

This is fantastic to see! 🙇

I think it's pretty much ready but for some minor comments. I'm trying it out now, but wanted to give some feedback ASAP. I'll update with any comments if I find anything once I try with Docker.

aggFTW · 2017-05-24T21:32:59Z

Dockerfile.spark

+
+ENV LIVY_BUILD_VERSION livy-server-0.3.0
+ENV LIVY_APP_PATH /apps/$LIVY_BUILD_VERSION
+ENV LIVY_BUILD_PATH /apps/build/livy


We need to create a python3 environment and set the PYSPARK3_PYTHON variable as explained in https://github.com/cloudera/livy/blob/511a05f2282cd85a457017cc5a739672aaed5238/README.rst#pyspark3

I'd recommend installing Anaconda to create the environment.

I guess it would be the same for PYSPARK_PYTHON, but if it's not set, it will just take the system's python, which is fine as long as it's a 2.7.X version.

Something similar to this:

ANACONDA_DEST_PATH=/usr/bin/anaconda CONDA_PATH=$ANACONDA_DEST_PATH/bin/conda # TODO wget anaconda to ANACONDA_DEST bash $INSTALLER_PATH -p $ANACONDA_DEST_PATH -b -f # Create an env for Python 3. $CONDA_PATH create -n py35 python=3.5 anaconda

I've made the other changes, but I'm not sure I understand the need for this one. The base image I'm using for the Livy container (gettyimages/spark:2.1.0-hadoop-2.7) already has a Python3 installation which seems to be working just fine with Livy and sparkmagic as-is.

Do you just mean creating a virtualenv for it? I can do that, though I'm not sure it's necessary if we're just running inside a single-purpose container.

Sorry I wasn't clear. I meant we need to install the following two kernels, besides the kernels already being installed:

pysparkkernel

sparkkernel

You can just add sparkkernel to the list of kernels to install, and it should just work.

pysparkkernel is different in that it should run in a python 2.7 environment whereas pyspark3kernel would run in a python 3 environment. Because the container's python installation is python 3, we need to tell livy where to find the python 2 installation in the image (created via virtualenv or anaconda or something else). The way livy finds the two installations is via the env variables i mentioned above.

The point would be to try to show users that one can have two python versions running fine side by side and how to set it up. This is some extra work, and I would be OK with you just adding sparkkernel for now and me creating an issue to add pysparkkernel to the docker image to tackle later.

Ah, I see what you mean then :) Okay, let me add that real quick.

aggFTW · 2017-05-24T22:41:13Z

Dockerfile.jupyter

+COPY sparkmagic/example_config.json /home/$NB_USER/.sparkmagic/config.json
+RUN sed -i 's/localhost/spark/g' /home/$NB_USER/.sparkmagic/config.json
+RUN jupyter nbextension enable --py --sys-prefix widgetsnbextension
+RUN jupyter-kernelspec install --user $(pip show sparkmagic | grep Location | cut -d" " -f2)/sparkmagic/kernels/pyspark3kernel


I think we should be installing:

sparkkernel

pysparkkernel

pyspark3kernel

~~sparkrkernel (?)~~

~~Is R properly set up for Livy on the image? I'll try it as soon as I'm done with the review~~

Yeah, so R is not set up properly for Livy. Let's disable installing the SparkR kernel.

Fixed, by setting up R properly. Tested and works!

aggFTW · 2017-05-24T22:46:17Z

README.md

+You will then be able to access the Jupyter notebook in your browser at
+http://localhost:8888. Inside this notebook, you can configure a
+sparkmagic endpoint at http://spark:8998. This endpoint is able to
+launch both Scala and Python sessions.


We should also mention that the managed sparkmagic kernels will be available.

All except SparkR.

aggFTW · 2017-05-24T23:56:36Z

README.md

+then simply run:
+
+    docker-compose build
+    docker-compose up


I think it would also be a good idea to add instructions for exiting...

Ctrl-C + docker-compose down?

Also added an R section to example_config.json to make it work out of the box - and I think it's just a good thing to have it anyway, otherwise how would users ever know it was meant to be there?

Disabled by default. When enabled, builds the container using your local copy of sparkmagic, so that you can test your development changes inside the container.

aggFTW · 2017-05-30T23:05:19Z

Thanks for the updates! If you can just add sparkkernel and do something about pysparkkernel, whereas create an issue or create a python 2 virtualenv or similar, I think we'd be ready for a merge.

Thanks for your contribution!

Was missing Scala and Python2. Confirmed that Python2 and Python3 are indeed separate environments on the spark container.

apetresc · 2017-06-01T20:45:34Z

Comments addressed :) Debian has separate packages+environments for python2 and python3, so it was just a matter of making sure both are installed and that PYSPARK_PYTHON and PYSPARK3_PYTHON were pointing at the right ones.

I confirmed manually that they are in fact in separate environments and the right kernel is pointing to the right version.

aggFTW · 2017-06-01T21:31:08Z

Thanks for the awesome contribution! I know a lot of people will find it very useful.

* Make location of config.json file configurable using environment variables (#350) * Make location of config.json file configurable using environment variables * Update minor version to 0.11.3 * Fix column drop issue when first row has missing value (#353) * Remove extra line * initial fix of dropping columns * add unit tests * revert sql query test change * revert sql query test change 2 * bump versions * move outside if * Adding a working Docker setup for developing sparkmagic (#361) * Adding a working Docker setup for developing sparkmagic It includes the Jupyter notebook as well as the Livy+Spark endpoint. Documentation is in the README * Pre-configure the ~/.sparkmagic/config.json Now you can just launch a PySpark wrapper kernel and have it work out of the box. * Add R to Livy container Also added an R section to example_config.json to make it work out of the box - and I think it's just a good thing to have it anyway, otherwise how would users ever know it was meant to be there? * Add more detail to the README container section * Add dev_mode build-arg. Disabled by default. When enabled, builds the container using your local copy of sparkmagic, so that you can test your development changes inside the container. * Adding missing kernels Was missing Scala and Python2. Confirmed that Python2 and Python3 are indeed separate environments on the spark container. * Kerberos authentication support (#355) * Enabled kerberos authentication on sparkmagic and updated test cases. * Enabled hide and show username/password based on auth_type. * Updated as per comments. * Updated documentation for kerberos support * Added test cases to test backward compatibility of auth in handlers * Update README.md Change layout and add build status * Bump version to 0.12.0 (#365) * Remove extra line * bump version * Optional coerce (#367) * Remove extra line * added optional configuration to have optional coercion * fix circular dependency between conf and utils * add gcc installation for dev build * fix parsing bug for coerce value * fix parsing bug for coerce value 2 * Automatically configure wrapper-kernel endpoints in widget (#362) * Add pre-configured endpoints to endpoint widget automatically * Fix crash on partially-defined kernel configurations * Use LANGS_SUPPORTED constant to get list of possible kernel config sections * Rename is_default attr to implicitly_added * Adding blank line between imports and class declaration * Log failure to connect to implicitly-defined endpoints * Adding comment explaining implicitly_added * Pass auth parameter through * Fix hash and auth to include auth parameter (#370) * Fix hash and auth to include auth parameter * fix endpoint validation * remove unecessary commit * Ability to add custom headers to HTTP calls (#371) * Abiulity to add custom headers to rest call * Fix import * Ad basic conf test * Fix tests * Add test * Fix tests * Fix indent * Addres review comments * Add custom headers to example config

* Remove extra line * Release 0.12.0 (#373) * Make location of config.json file configurable using environment variables (#350) * Make location of config.json file configurable using environment variables * Update minor version to 0.11.3 * Fix column drop issue when first row has missing value (#353) * Remove extra line * initial fix of dropping columns * add unit tests * revert sql query test change * revert sql query test change 2 * bump versions * move outside if * Adding a working Docker setup for developing sparkmagic (#361) * Adding a working Docker setup for developing sparkmagic It includes the Jupyter notebook as well as the Livy+Spark endpoint. Documentation is in the README * Pre-configure the ~/.sparkmagic/config.json Now you can just launch a PySpark wrapper kernel and have it work out of the box. * Add R to Livy container Also added an R section to example_config.json to make it work out of the box - and I think it's just a good thing to have it anyway, otherwise how would users ever know it was meant to be there? * Add more detail to the README container section * Add dev_mode build-arg. Disabled by default. When enabled, builds the container using your local copy of sparkmagic, so that you can test your development changes inside the container. * Adding missing kernels Was missing Scala and Python2. Confirmed that Python2 and Python3 are indeed separate environments on the spark container. * Kerberos authentication support (#355) * Enabled kerberos authentication on sparkmagic and updated test cases. * Enabled hide and show username/password based on auth_type. * Updated as per comments. * Updated documentation for kerberos support * Added test cases to test backward compatibility of auth in handlers * Update README.md Change layout and add build status * Bump version to 0.12.0 (#365) * Remove extra line * bump version * Optional coerce (#367) * Remove extra line * added optional configuration to have optional coercion * fix circular dependency between conf and utils * add gcc installation for dev build * fix parsing bug for coerce value * fix parsing bug for coerce value 2 * Automatically configure wrapper-kernel endpoints in widget (#362) * Add pre-configured endpoints to endpoint widget automatically * Fix crash on partially-defined kernel configurations * Use LANGS_SUPPORTED constant to get list of possible kernel config sections * Rename is_default attr to implicitly_added * Adding blank line between imports and class declaration * Log failure to connect to implicitly-defined endpoints * Adding comment explaining implicitly_added * Pass auth parameter through * Fix hash and auth to include auth parameter (#370) * Fix hash and auth to include auth parameter * fix endpoint validation * remove unecessary commit * Ability to add custom headers to HTTP calls (#371) * Abiulity to add custom headers to rest call * Fix import * Ad basic conf test * Fix tests * Add test * Fix tests * Fix indent * Addres review comments * Add custom headers to example config * bumping versions

* Release 0.12.0 (jupyter-incubator#373) * Make location of config.json file configurable using environment variables (jupyter-incubator#350) * Make location of config.json file configurable using environment variables * Update minor version to 0.11.3 * Fix column drop issue when first row has missing value (jupyter-incubator#353) * Remove extra line * initial fix of dropping columns * add unit tests * revert sql query test change * revert sql query test change 2 * bump versions * move outside if * Adding a working Docker setup for developing sparkmagic (jupyter-incubator#361) * Adding a working Docker setup for developing sparkmagic It includes the Jupyter notebook as well as the Livy+Spark endpoint. Documentation is in the README * Pre-configure the ~/.sparkmagic/config.json Now you can just launch a PySpark wrapper kernel and have it work out of the box. * Add R to Livy container Also added an R section to example_config.json to make it work out of the box - and I think it's just a good thing to have it anyway, otherwise how would users ever know it was meant to be there? * Add more detail to the README container section * Add dev_mode build-arg. Disabled by default. When enabled, builds the container using your local copy of sparkmagic, so that you can test your development changes inside the container. * Adding missing kernels Was missing Scala and Python2. Confirmed that Python2 and Python3 are indeed separate environments on the spark container. * Kerberos authentication support (jupyter-incubator#355) * Enabled kerberos authentication on sparkmagic and updated test cases. * Enabled hide and show username/password based on auth_type. * Updated as per comments. * Updated documentation for kerberos support * Added test cases to test backward compatibility of auth in handlers * Update README.md Change layout and add build status * Bump version to 0.12.0 (jupyter-incubator#365) * Remove extra line * bump version * Optional coerce (jupyter-incubator#367) * Remove extra line * added optional configuration to have optional coercion * fix circular dependency between conf and utils * add gcc installation for dev build * fix parsing bug for coerce value * fix parsing bug for coerce value 2 * Automatically configure wrapper-kernel endpoints in widget (jupyter-incubator#362) * Add pre-configured endpoints to endpoint widget automatically * Fix crash on partially-defined kernel configurations * Use LANGS_SUPPORTED constant to get list of possible kernel config sections * Rename is_default attr to implicitly_added * Adding blank line between imports and class declaration * Log failure to connect to implicitly-defined endpoints * Adding comment explaining implicitly_added * Pass auth parameter through * Fix hash and auth to include auth parameter (jupyter-incubator#370) * Fix hash and auth to include auth parameter * fix endpoint validation * remove unecessary commit * Ability to add custom headers to HTTP calls (jupyter-incubator#371) * Abiulity to add custom headers to rest call * Fix import * Ad basic conf test * Fix tests * Add test * Fix tests * Fix indent * Addres review comments * Add custom headers to example config * Merge master to release (jupyter-incubator#390) * Configurable retry for errors (jupyter-incubator#378) * Remove extra line * bumping versions * configurable retry * fix string * Make statement and session waiting more responsive (jupyter-incubator#379) * Remove extra line * bumping versions * make sleeping for sessions an exponential backoff * fix bug * Add vscode tasks (jupyter-incubator#383) * Remove extra line * bumping versions * add vscode tasks * Fix endpoints widget when deleting a session (jupyter-incubator#389) * Remove extra line * bumping versions * add vscode tasks * fix deleting from endpoint widget, add notebooks to docker file, refresh correctly, populate endpoints correctly * fix tests * add unit tests * refresh after cleanup * Merge master to release (jupyter-incubator#392) * Configurable retry for errors (jupyter-incubator#378) * Remove extra line * bumping versions * configurable retry * fix string * Make statement and session waiting more responsive (jupyter-incubator#379) * Remove extra line * bumping versions * make sleeping for sessions an exponential backoff * fix bug * Add vscode tasks (jupyter-incubator#383) * Remove extra line * bumping versions * add vscode tasks * Fix endpoints widget when deleting a session (jupyter-incubator#389) * Remove extra line * bumping versions * add vscode tasks * fix deleting from endpoint widget, add notebooks to docker file, refresh correctly, populate endpoints correctly * fix tests * add unit tests * refresh after cleanup * Try to fix pypi repos (jupyter-incubator#391) * Remove extra line * bumping versions * add vscode tasks * try to fix pypi new repos * Merge master to release (jupyter-incubator#394) * Configurable retry for errors (jupyter-incubator#378) * Remove extra line * bumping versions * configurable retry * fix string * Make statement and session waiting more responsive (jupyter-incubator#379) * Remove extra line * bumping versions * make sleeping for sessions an exponential backoff * fix bug * Add vscode tasks (jupyter-incubator#383) * Remove extra line * bumping versions * add vscode tasks * Fix endpoints widget when deleting a session (jupyter-incubator#389) * Remove extra line * bumping versions * add vscode tasks * fix deleting from endpoint widget, add notebooks to docker file, refresh correctly, populate endpoints correctly * fix tests * add unit tests * refresh after cleanup * Try to fix pypi repos (jupyter-incubator#391) * Remove extra line * bumping versions * add vscode tasks * try to fix pypi new repos * Test 2.7.13 environment for pypi push to prod (jupyter-incubator#393) * Remove extra line * bumping versions * add vscode tasks * try to fix pypi new repos * try to fix pip push for prod pypi by pinning to later version of python * bump versions (jupyter-incubator#395) * Release v0.12.6 (jupyter-incubator#481) * Add python3 option in %manage_spark magic (jupyter-incubator#427) Fixes jupyter-incubator#420 * Links fixed in README * DataError in Pandas moved from core.groupby to core.base (jupyter-incubator#459) * DataError in Pandas moved from core.groupby to core.base * maintain backwards compatability with Pandas 0.22 or lower for DataError * Bump autoviz version to 0.12.6 * Fix unit test failure caused by un-spec'ed mock which fails traitlet validation (jupyter-incubator#480) * Fix failing unit tests Caused by an un-spec'ed mock in a test which fails traitlet validation * Bump travis.yml Python3 version to 3.6 Python 3.3 is not only EOL'ed but is now actively unsupported by Tornado, which causes the Travis build to fail again. * Bumping version numbers for hdijupyterutils and sparkmagic to keep them in sync * add magic for matplotlib display * repair * Patch SparkMagic for latest IPythonKernel compatibility **Description** * The IPython interface was updated to return an asyncio.Future rather than a dict from version 5.1.0. This broke SparkMagic as it still expects a dictionart from the output * This change updates the SparkMagic base kernel to expect a Future and block on its result. * This also updates the dependencies to call out the the new IPython version dependency. **Testing Done** * Unit tests added * Validating that the kernel connects successfully * Validating some basic Spark additional operations on an EMR cluster. * Fix decode json error at trailing empty line (jupyter-incubator#483) * Bump version number to 0.12.7 * add a screenshot of an example for display matplot picture * Fix guaranteed stack trace * Simplify loop a bit * We want to be able to interrupt the sleep, so move that outside the try / except * Add missing session status to session. * Correct to correct URL with full list. * Better tests. * Switch to Livy 0.6. * Sketch of removal of PYSPARK3. * Don't allow selection of Python3, since it's not a separate thing now. * __repr__ for easier debugging of test failures. * Start fixing tests. * Rip out more no-longer-relevant "Python 3" code. Python 3 and Python 2 work again. * Changelog. * Add progress bar to sparkmagic/sparkmagic/livyclientlib/command.py. Tested with livy 0.4-0.6, python2 and python3 * Support Future and non-Future results from ipykernel. * News entry. * Unpin ipykernel so it works with Python 2. * Python 3.7 support. * Also update requirements.txt. * Xenial has 3.7. * from IPython.display import display to silence travis warning * Couple missing entries. * Update versions. * Document release process, as I understand it. * Correct file name. * delete obsolete pyspark3kernel (jupyter-incubator#549) * delete obsolete pyspark3kernel * Update README.md * Update setup.py * Update test_kernels.py * Remove old kernelspec installation from Dockerfile This kernel was removed in jupyter-incubator#549 but the Dockerfile still tries to install it, which fails the build. This corrects that. * Relax constraints even more, and make sure to relax them in duplicate locations. * Don't assume some pre-populated tables, create a new table from the Docker image's examples. * Note new feature. * Additional dependencies for matplotlib to work. * Add support and documentation for extension use, refactor kernel use. * Example in pyspark kernel. * Test for Command's decoding of images. * Switch to plotly 3. * Try to switch to standard mechanism Sparkmagic uses for displaying. * Another entry. * Add documentation for JupyterLab. * Prepare for 0.12.9. * Revert get_session_kind change to be more consistent with upstream repo. * Remove redundant python3 session test. * Remove python3 references in livysession.

apetresc added 2 commits May 24, 2017 17:03

Adding a working Docker setup for developing sparkmagic

b3aa0ee

It includes the Jupyter notebook as well as the Livy+Spark endpoint. Documentation is in the README

Pre-configure the ~/.sparkmagic/config.json

89947f9

Now you can just launch a PySpark wrapper kernel and have it work out of the box.

aggFTW requested changes May 24, 2017

View reviewed changes

aggFTW reviewed May 24, 2017

View reviewed changes

apetresc added 3 commits May 26, 2017 11:35

Add R to Livy container

e8042ad

Also added an R section to example_config.json to make it work out of the box - and I think it's just a good thing to have it anyway, otherwise how would users ever know it was meant to be there?

Add more detail to the README container section

f1e5490

Add dev_mode build-arg.

663b5d3

Disabled by default. When enabled, builds the container using your local copy of sparkmagic, so that you can test your development changes inside the container.

Adding missing kernels

649538a

Was missing Scala and Python2. Confirmed that Python2 and Python3 are indeed separate environments on the spark container.

aggFTW approved these changes Jun 1, 2017

View reviewed changes

aggFTW merged commit 610c916 into jupyter-incubator:master Jun 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a working Docker setup for developing sparkmagic #361

Adding a working Docker setup for developing sparkmagic #361

apetresc commented May 24, 2017 •

edited

Loading

aggFTW left a comment

aggFTW May 24, 2017

aggFTW May 24, 2017

aggFTW May 24, 2017

apetresc May 26, 2017

aggFTW May 30, 2017

apetresc Jun 1, 2017

aggFTW May 24, 2017 •

edited

Loading

aggFTW May 24, 2017

apetresc May 26, 2017

aggFTW May 24, 2017

aggFTW May 24, 2017

apetresc May 26, 2017

aggFTW May 24, 2017

apetresc May 26, 2017

aggFTW commented May 30, 2017

apetresc commented Jun 1, 2017

aggFTW commented Jun 1, 2017

Adding a working Docker setup for developing sparkmagic #361

Adding a working Docker setup for developing sparkmagic #361

Conversation

apetresc commented May 24, 2017 • edited Loading

aggFTW left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aggFTW May 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aggFTW commented May 30, 2017

apetresc commented Jun 1, 2017

aggFTW commented Jun 1, 2017

apetresc commented May 24, 2017 •

edited

Loading

aggFTW May 24, 2017 •

edited

Loading